# Twitter Analysis - Working with the Data


The .py file corresponding to this code can be found [here](https://github.com/covid-19-impact-lab/twitter_analysis/blob/master/src/data_management/cleaning.py).

In [1]:
data_path = "/home/tm/sciebo/corona/twitter_analysis/src/original_data/corona_data/"

In [2]:
from pathlib import Path

import numpy as np
import pandas as pd
import pyarrow.parquet as pq

In [3]:
UNNECESSARY_COLUMNS = ["formatted_date", "geo"]

In [4]:
def load_data():
    paths = list(Path(data_path).glob("**/*.parquet"))

    dfs = []
    for path in paths:
        table = pq.read_table(path)
        df = table.to_pandas()

        # Add state and city from path.
        df["state"] = path.parents[3].name
        df["city"] = path.parents[2].name

        dfs.append(df)

    df = pd.concat(dfs, sort=False)

    return df

In [5]:
def minimal_preprocessing(df):
    replace_to = {None: np.nan, "": np.nan}

    df = df.replace(replace_to)

    df = df.drop_duplicates(subset="id")

    df = df.drop(columns=UNNECESSARY_COLUMNS)

    df.id = df.id.astype(np.uint64)
    df = df.set_index("id")

    return df

In [6]:
df = load_data()

In [7]:
df = minimal_preprocessing(df)

The data set is already quite large and will grow even larger over time. For a quick exploratory analysis we recommend to subset the data.

In [8]:
df.shape

(191603, 14)

In [9]:
dff = df.sample(1000, random_state=1)

In [10]:
dff.columns

Index(['username', 'to', 'text', 'retweets', 'favorites', 'replies',
       'permalink', 'author_id', 'date', 'hashtags', 'mentions', 'urls',
       'state', 'city'],
      dtype='object')

In [11]:
dff[["text", "date", "hashtags", "state", "city"]].reset_index().drop("id", axis=1).head(20)

Unnamed: 0,text,date,hashtags,state,city
0,"Besonderer Dank an @stephanschmidt, @sgpqr, @P...",2020-04-05 16:34:37+00:00,#Corona,Berlin,Berlin
1,Fluid dynamics work hints at whether spoken wo...,2020-04-09 20:43:47+00:00,,Baden-Wuerttemberg,Heidelberg
2,#Coronavirus I Bußgeldkatalog beschlossen Vers...,2020-04-09 14:40:27+00:00,#Coronavirus,Niedersachsen,Hannover
3,In Baden-Württemberg gelten an #Ostern die gle...,2020-04-09 08:08:08+00:00,#Ostern #Feiertage #Corona,Baden-Wuerttemberg,Heidelberg
4,Vor zwei Wochen hatte sich die Kanzlerin selbs...,2020-04-03 15:27:19+00:00,#Quarant #Merkel #Corona,Niedersachsen,Goettingen
5,"Ja, es ist total hilfreich Corona als harmlose...",2020-04-14 08:18:34+00:00,,Nordrhein-Westfalen,Nordrhein-Westfalen
6,Is this during corona?,2020-04-08 20:04:18+00:00,,Berlin,Berlin
7,Neue Therapie kann schwerkranken Corona-Patien...,2020-04-08 04:49:14+00:00,,Berlin,Berlin
8,Fraglich ob das ein effizientes Nutzen der akt...,2020-04-03 05:15:46+00:00,,Nordrhein-Westfalen,Koeln
9,“Wir sind in einer Existenzkrise”: Grüne-Jugen...,2020-04-05 14:54:52+00:00,#Corona,Niedersachsen,Goettingen
