# D-Drivers Capstone project (file of Thomas)

## Content

1. Prepare Environment
2. Load data
3. EDA + first Insights + Preprocessing + further Insights
4. Conclusion

# 1. Prepare Environment

- create virtual environment with requirenments_dev.txt
- add `data/`and `.DS_Store` to .gitignore

<details><summary>
Click here for a details...
</summary>

```bash
pyenv local 3.11.3
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements_dev.txt
```
</details>


In [88]:
import pandas as pd

# Charting
import plotly.express as px

# 2. Load data

In [89]:
data_file = 'discover_2024-03-26.xlsx'
file_path = '../data/' + data_file
file_path

'../data/discover_2024-03-26.xlsx'

In [90]:
df = pd.read_excel(file_path, sheet_name='data')

# 3. EDA + first insights

In [91]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132846 entries, 0 to 132845
Data columns (total 17 columns):
 #   Column                      Non-Null Count   Dtype         
---  ------                      --------------   -----         
 0   PAGE_EFAHRER_ID             132846 non-null  int64         
 1   DATE                        132846 non-null  datetime64[ns]
 2   PUBLISHED_AT                42111 non-null   datetime64[ns]
 3   PUBLISH_DATE_EQUAL_TO_DATE  132846 non-null  object        
 4   PAGE_CANONICAL_URL          132846 non-null  object        
 5   PAGE_NAME                   132846 non-null  object        
 6   CLASSIFICATION_PRODUCT      132191 non-null  object        
 7   CLASSIFICATION_TYPE         132191 non-null  object        
 8   TITLE                       132846 non-null  object        
 9   PAGE_AUTHOR                 132846 non-null  object        
 10  DAILY_LIKES                 33623 non-null   float64       
 11  DAILY_DISLIKES              27291 non-n

In [92]:
df.isna().sum()

PAGE_EFAHRER_ID                    0
DATE                               0
PUBLISHED_AT                   90735
PUBLISH_DATE_EQUAL_TO_DATE         0
PAGE_CANONICAL_URL                 0
PAGE_NAME                          0
CLASSIFICATION_PRODUCT           655
CLASSIFICATION_TYPE              655
TITLE                              0
PAGE_AUTHOR                        0
DAILY_LIKES                    99223
DAILY_DISLIKES                105555
WORD_COUNT                     91207
VIDEO_PLAY                       776
IMPRESSIONS                      776
DISCOVER_CLICKS                  776
DISCOVER_IMPRESSIONS             776
dtype: int64

In [93]:
# keep raw data for later
df_raw = df.copy()

In [94]:
df['PAGE_EFAHRER_ID'].unique().size

6899

--> use VS Code Add-on "Data Wrangler (Preview) for visually exploring data by clicking on "Open 'df'" below `df.head()`

In [95]:
df.head()

Unnamed: 0,PAGE_EFAHRER_ID,DATE,PUBLISHED_AT,PUBLISH_DATE_EQUAL_TO_DATE,PAGE_CANONICAL_URL,PAGE_NAME,CLASSIFICATION_PRODUCT,CLASSIFICATION_TYPE,TITLE,PAGE_AUTHOR,DAILY_LIKES,DAILY_DISLIKES,WORD_COUNT,VIDEO_PLAY,IMPRESSIONS,DISCOVER_CLICKS,DISCOVER_IMPRESSIONS
0,1010803,2023-01-02,NaT,N,https://efahrer.chip.de/news/tariferhoehungen-...,efa-1010803 | Tariferhöhungen und THG-Prämie: ...,THG,News,Tariferhöhungen und THG-Prämie: Ladesäulenbet...,Karl Lüdecke,,,,1261.0,1375.0,1301.0,20323.0
1,1010592,2023-01-02,NaT,N,https://efahrer.chip.de/news/das-logo-von-alfa...,efa-1010592 | Alfa Romeo: Was bedeuten Schlang...,Auto,News,Alfa Romeo: Was bedeuten Schlange und Kreuz?,Karl Müller,,,,286.0,298.0,164.0,1493.0
2,1010719,2023-01-05,NaT,N,https://efahrer.chip.de/news/titel-ist-zurueck...,efa-1010719 | Rennen um die effizienteste Sola...,Solaranlagen,News,Rennen um die effizienteste Solarzelle: Deuts...,Aslan Berse,,,,156.0,300.0,303.0,4912.0
3,1010727,2023-01-05,NaT,N,https://efahrer.chip.de/news/entlastungen-fuer...,efa-1010727 | Antrag stellen oder leer ausgehe...,Energie,Ratgeber,Antrag stellen oder leer ausgehen: Diese Entl...,CHIP,,,,16.0,55.0,14009.0,92422.0
4,1010557,2023-01-02,2023-01-02,Y,https://efahrer.chip.de/news/solaranlage-auch-...,efa-1010557 | Balkonkraftwerk kaufen: Das sind...,Balkonkraftwerk,Kaufberatung,Balkonkraftwerk kaufen: Das sind die besten M...,Eva Goldschald,17.0,1.0,1513.0,174.0,128.0,6494.0,114984.0


## 3.1 DRAFT charts

In [96]:
df_filtered = df[df['year'] == 2024] 

KeyError: 'year'

In [None]:
# yearly_counts = df.groupby(['yr_built', 'historic_house']).size().reset_index(name='counts')
fig = px.bar(df_filtered, 
             x="date", 
             y="discover_clicks", 
             # text_auto=True, 
             color='is_weekend',
             title="clicks by day")
fig.update_layout(
    xaxis_title='date',
    yaxis_title="clicks"
)
fig.show()

## 3.2. Preprocessing
- lowering all columns
- add date artefacts (year, month, day of week, is_wekend)
- mask target columns
- how to deal with NULL values in likes and dislikes?

In [None]:
columns = [col.lower() for col in df.columns]
df.columns = columns

In [None]:
# add coloum day_of_week  (Monday=0, Sunday=6)
df['day_of_week'] = df['date'].dt.dayofweek

# add column is_weekend (True for Sat. and Sun.)
df['is_weekend'] = df['day_of_week'].apply(lambda x: True if x >5 else False)

# add column year
df['year'] = df['date'].dt.year

# add column month
df['month'] = df['date'].dt.month

# replace date column as date to replace datetime column
df['date'] = df['date'].dt.date

## 3.3 EDA with whole team


In [None]:
counts = df.groupby(['page_efahrer_id', 'date', 'title', 'page_name', 'page_canonical_url', 'classification_type', 'classification_product', 'page_author']).count().describe()
# counts.query('publish_date_equal_to_date > 1')
counts

Unnamed: 0,published_at,publish_date_equal_to_date,daily_likes,daily_dislikes,word_count,video_play,clickouts,discover_clicks,discover_impressions,day_of_week,is_weekend,datetime
count,131890.0,131890.0,131890.0,131890.0,131890.0,131890.0,131890.0,131890.0,131890.0,131890.0,131890.0,131890.0
mean,0.318045,1.000129,0.254159,0.206286,0.314467,0.994283,0.994283,0.994283,0.994283,1.000129,1.000129,1.000129
std,0.465979,0.011353,0.435633,0.404845,0.464566,0.076986,0.076986,0.076986,0.076986,0.011353,0.011353,0.011353
min,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
25%,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
50%,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
75%,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
max,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0


In [None]:
df['page_name'].value_counts()

page_name
efa-1014301 | Diese PV-Anlage kommt in die Steckdose: Leistung Ã¼bertrifft Balkonkraftwerke          566
efa-105259 | E-Auto fÃ¼r 65 Euro: Hier kommt der Billig-Deal fÃ¼r den Elektro-Dacia                  561
efa-1012150 | Komplette Solaranlage fÃ¼r 10.000 Euro: Das mÃ¼ssen Sie Ã¼ber den Deal wissen          507
efa-109751 | Elektrischer China-Golf im Leasing: Der MG MG4 kostet unter 200 Euro pro Monat          490
efa-1012593 | Cupra Born fÃ¼r Privatkunden mit garantierte BAFA-PrÃ¤mie bei Bestellung bis 19.05     488
                                                                                                    ... 
efa-1017558 | KÃ¶nnen Verbraucher darauf hoffen? Das taugt Biomethan fÃ¼r die Heizung in Zukunft       1
efa-109736 | Was passiert, wenn E-Autos ins Wasser fallen? Tesla-Funktion Ã¼berrascht                  1
efa-1011434 | Nachhaltig bauen: So gelingt die ressourcenschonende Planung Ihres Eigenheims            1
efa-1017567 | Deutsche Stadt sucht FlÃ¤chen f

In [None]:
df[["page_efahrer_id", "page_name"]].drop_duplicates().groupby("page_name").count()

Unnamed: 0_level_0,page_efahrer_id
page_name,Unnamed: 1_level_1
efa-1010012 | Hersteller reduziert fast alle Solargeneratoren: Diese kÃ¶nnen wir empfehlen,1
efa-1010022 | Beliebter Irrglaube: Warum PV-Anlagen beim Blackout auch keinen Strom liefern,1
efa-1010045 | Kleiner Aufkleber auf dem Rad: Diese Zahl ist beim E-Bike lebenswichtig,1
efa-1010048 | Kann Solarstrom fÃ¼r den Winter speichern: Deutscher zeigt geniale Erfindung,1
efa-1010057 | Dreimal schneller zu FuÃŸ: Genialer Elektro-Schuh hat mehr Power als E-Bike,1
...,...
efa-109925 | E-Bikes und E-Roller bei Hagebau.de,1
efa-109934 | E-Bike-Umbau-Kit soll in 30 Sekunden installiert sein: Der Preis ist der Hammer,1
"efa-109968 | TschÃ¼ss, Photovoltaik? Diese Solaranlage erzeugt Wasserstoff statt Strom",1
efa-109971 | Top 10: Diese E-Bike-Marken fahren die Deutschen am liebsten,1


In [None]:
df.groupby("page_efahrer_id").count()

Unnamed: 0_level_0,date,published_at,publish_date_equal_to_date,page_canonical_url,page_name,classification_product,classification_type,title,page_author,daily_likes,daily_dislikes,word_count,video_play,clickouts,discover_clicks,discover_impressions,day_of_week,is_weekend,datetime
page_efahrer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1037,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7
1039,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
1040,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10
10245,3,0,3,3,3,3,3,3,3,0,0,0,3,3,3,3,3,3,3
10273,26,25,26,26,26,26,26,26,26,25,25,25,26,26,26,26,26,26,26
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1018763,1,0,1,1,1,0,0,1,1,0,0,0,1,1,1,1,1,1,1
1018764,1,0,1,1,1,0,0,1,1,0,0,0,1,1,1,1,1,1,1
1018766,1,0,1,1,1,0,0,1,1,0,0,0,1,1,1,1,1,1,1
1018767,1,0,1,1,1,0,0,1,1,0,0,0,1,1,1,1,1,1,1


In [None]:
df['page_efahrer_id'].value_counts()

page_efahrer_id
1014301    566
105259     561
1012150    507
109751     490
1012593    488
          ... 
1011281      1
1017558      1
109736       1
1011434      1
1018511      1
Name: count, Length: 6891, dtype: int64

In [None]:
df[["page_efahrer_id", "page_name"]].drop_duplicates().groupby("page_name").count().max()

page_efahrer_id    1
dtype: int64