# Data Understanding

In this file we will take a first look at the data we have. We will try to understand the data and get a first impression of it. We will also try to get a first idea of what we can do with the data.

Afterwards we will clean and prepare the data in [02_preparation.ipynb](02_preparation.ipynb). Afterwards we will plot the data in [03_plotting.ipynb](03_plotting.ipynb).


## Config and Imports


In [None]:
from pathlib import Path
import json

import numpy as np

from studienarbeit.utils.load import EDataTypes, Load

In [None]:
file_name = "prep_tweets_sent_full.parquet"
data_type = EDataTypes.TWEETS

load = Load(data_type=data_type)

In [None]:
df = load.load_dataframe(
    "tweets.parquet", columns=["screen_name", "created_at", "is_retweet", "text", "party", "birthyear", "gender"]
)

## Analyse the Data


In [None]:
# Print the initial shape of the dataframe
df.shape

In [None]:
# Replace empty string-array conversions from r to python
for col in df.columns:
    df[col] = df[col].apply(
        lambda x: None if x == "" or x == "NA" or x == "NA, NA" or x == "NA, NA, NA, NA, NA, NA, NA, NA" else x
    )

In [None]:
# Print the count and distribution before any preprocessing
print(f"Shape before dropping na: {df[df['party'] != 'Parteilos'].shape}")
print(f"\nParty distribution before preprocessing: \n{df[df['party'] != 'Parteilos']['party'].value_counts()}")

In [None]:
# Count the number of retweets per party
df[df["is_retweet"] == "TRUE"]["party"].value_counts()

In [None]:
# Calculate the mean number of tweets (id) per screen_name grouped by party
df.groupby("party")["screen_name"].value_counts().groupby("party").describe()

In [None]:
# Count the number of unique users
df["screen_name"].nunique()

In [None]:
# Check for missing values
df.isna().sum()

In the cell above we can see that there are about 11k missing values in the `text` column. Regarding the `is_retweet` column, about 3k entries have missing values.

Following we will delete the rows.


In [None]:
df = df.dropna(subset=["text", "is_retweet"])

In [None]:
# Check how many unique values are in the columns
df.nunique()

In [None]:
# Clean duplicated rows (some tweets seem to be scraped twice at different days)
df = df.drop_duplicates(
    subset=["screen_name", "is_retweet", "text", "party", "birthyear", "gender"], keep="last"
)

In [None]:
df["party"].value_counts()

In [None]:
df.groupby("gender")["screen_name"].nunique()

In [None]:
convert_dict = {
    "screen_name": "string[pyarrow]",
    "created_at": "datetime64[ns]",
    "is_retweet": "category",
    "text": "string[pyarrow]",
    "party": "category",
    "birthyear": "datetime64[ns]",
    "gender": "category",
}

In [None]:
df = df.astype(convert_dict)

In [None]:
df.info(verbose=True, memory_usage="deep")

In [None]:
df.describe(include="all", datetime_is_numeric=True)

In [None]:
df.head()

In [None]:
load.save_dataframe(df, "tweets_understanding.parquet")

In [None]:
from ydata_profiling import ProfileReport

In [None]:
profile = ProfileReport(df, title="Profiling Report")

In [None]:
profile.to_file("tweets.html")