# Data Preparation
**Author**: Andrea Cass

## 1. About this notebook
The purpose of this notebook is to make some final preparations before the data can be explored and visualized. The data used is:
> *03_Sentiment-analysis_limited_merged.csv*

Goals:
* Seperate the set of probability scores into individual columns (Negative, Neutral, Positive)
* Create a new categorical column, CLASS, based on which sentiment classification has the largest probability score
* Create a new numerical column, _____, based on class (-1 for NEGATIVE, 0 for NEUTRAL, 1 for POSITIVE)

The output will be a single dataset saved as a csv filed titled,
> *04_Prepared-data_limited_merged.csv*

## 2. Imports

In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import csv
import os
from pathlib import Path

## 3. Working directory & file paths

Before beginning data and pre-processing, the working directory needs to be set up. You should have already created a folder called "CASS_thesis" within your desired working directory.

Two objects will be named:

* **cwd**: the current working directory (e.g., your Desktop)
* **CASS_thesis**: the folder where all data from my Notebooks will be saved

### 3.1. Current working directory
Use the code below to find out what your current working directory is set to.

In [14]:
# find current working directory

os.getcwd()

'/Users/andycass/Desktop/Thesis_data-and-code'

If your current working directory is not your desired directory, follow the subsequent steps to change the working diectory by:
1. deciding where you would like your working directory to be (e.g., your Desktop)
2. entering the file path of your desired working directory into the code below

**NOTE**: If you are satisfied with your working directory and do NOT wish to change it, skip the block of code underneath **3.1.1. Changing current working directory** and, instead, proceed from the block of code underneath **3.1.2. Naming current working directory**.

#### 3.1.1. Changing current working directory
**NOTE**: The code below contains the path to **my** desired working directory to serve as an example. You must alter it to the path of **your** desired working directory. Keep in mind that my example is formatted according to Macbook standards, and Windows formatting differs.

In [15]:
# changing current working directory

os.chdir('/Users/andycass/Desktop/Thesis_data-and-code')

#### 3.1.2. Naming current working directory
Now that your current working directory is established, use the code below to name it "cwd":

In [16]:
# naming the current working directory

cwd = Path.cwd()

In [17]:
# double-checking the current working directory location

cwd

PosixPath('/Users/andycass/Desktop/Thesis_data-and-code')

### 3.2 CASS_thesis

In [18]:
# naming the CASS_thesis folder

CASS_thesis = cwd / 'CASS_thesis'

In [19]:
# double-checking the CASS_thesis location

CASS_thesis

PosixPath('/Users/andycass/Desktop/Thesis_data-and-code/CASS_thesis')

## 4. Separating scores
### 4.1. Loading the data

In [20]:
df = pd.read_csv(CASS_thesis / '03_Sentiment-analysis_limited_merged.csv', index_col=[0])

  df = pd.read_csv(CASS_thesis / '03_Sentiment-analysis_limited_merged.csv', index_col=[0])


### 4.2. Viewing the dataframe

In [21]:
df

Unnamed: 0,text,author_id,created_at,lang,geo.place_id,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count,public_metrics.impression_count,...,week,month,year,year-week,year-month,Language,date,inflow,new text,scores
0,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",4.122038e+09,2016-04-20 22:55:08+00:00,de,06d9a7c249c59bcd,0.0,0.0,0.0,0.0,0.0,...,16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",[0.0679844 0.88380396 0.0482117 ]
1,"Habe schon lang nicht gehört, daß Flüchtling G...",1.179544e+09,2016-04-20 21:27:37+00:00,de,e99b714fe65be4fb,0.0,0.0,0.0,0.0,0.0,...,16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,"Habe schon lang nicht gehört, daß Flüchtling G...",[0.3660313 0.5833739 0.05059479]
2,"""Es kommen kaum noch Flüchtlinge nach Griechen...",2.246076e+08,2016-04-20 21:18:58+00:00,de,3078869807f9dd36,0.0,0.0,0.0,0.0,0.0,...,16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,"""Es kommen kaum noch Flüchtlinge nach Griechen...",[0.30297568 0.4430802 0.25394407]
3,Unsere 1. Kochshow für #Flüchtlinge. Super spi...,2.480764e+09,2016-04-20 18:25:11+00:00,de,8abc99434d4f5d28,0.0,0.0,4.0,0.0,0.0,...,16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,Unsere 1. Kochshow für #Flüchtlinge. Super spi...,[0.02260253 0.07604674 0.9013506 ]
4,500 tote #Flüchtlinge im #Mittelmeer – Tragödi...,6.062653e+08,2016-04-20 16:27:28+00:00,de,e11a8b8e3771f9fa,0.0,1.0,0.0,0.0,0.0,...,16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,500 tote #Flüchtlinge im #Mittelmeer – Tragödi...,[0.6982864 0.27545568 0.02625783]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66260,"For day 1 of week 2, @AnnaMariaKonsta discusse...",1.104025e+08,2021-06-28 14:43:31+00:00,en,fcbb3c6e0a7eba22,0.0,1.0,2.0,0.0,0.0,...,26.0,6.0,2021.0,2021-26,2021-06,English,2021-06-28 00:00:00+00:00,Ukrainians,"For day 1 of week 2, @user discusses the Socia...",[0.03550766 0.82208896 0.14240335]
66261,"@ariadneconill Europe is racist, but in a diff...",2.521809e+09,2021-06-27 13:03:29+00:00,en,e385d4d639c6a423,0.0,1.0,6.0,0.0,0.0,...,25.0,6.0,2021.0,2021-26,2021-06,English,2021-06-27 00:00:00+00:00,Ukrainians,"@user Europe is racist, but in a different way...",[0.8931986 0.09380516 0.01299612]
66262,"A labour of love, inspired by Middle-earth.\n\...",5.633818e+08,2021-06-27 08:37:21+00:00,en,257640324f249a73,0.0,1.0,17.0,0.0,0.0,...,25.0,6.0,2021.0,2021-26,2021-06,English,2021-06-27 00:00:00+00:00,Ukrainians,"A labour of love, inspired by Middle-earth.\n\...",[0.15655968 0.5201203 0.32332003]
66263,@simongerman600 I must have missed the great f...,2.591892e+09,2021-06-26 08:03:22+00:00,en,000b71538f35fe46,0.0,0.0,1.0,0.0,0.0,...,25.0,6.0,2021.0,2021-25,2021-06,English,2021-06-26 00:00:00+00:00,Ukrainians,@user I must have missed the great flight of t...,[0.91320664 0.06375945 0.02303378]


In [22]:
# resetting index

df = df.reset_index(drop=True)

In [23]:
df

Unnamed: 0,text,author_id,created_at,lang,geo.place_id,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count,public_metrics.impression_count,...,week,month,year,year-week,year-month,Language,date,inflow,new text,scores
0,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",4.122038e+09,2016-04-20 22:55:08+00:00,de,06d9a7c249c59bcd,0.0,0.0,0.0,0.0,0.0,...,16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",[0.0679844 0.88380396 0.0482117 ]
1,"Habe schon lang nicht gehört, daß Flüchtling G...",1.179544e+09,2016-04-20 21:27:37+00:00,de,e99b714fe65be4fb,0.0,0.0,0.0,0.0,0.0,...,16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,"Habe schon lang nicht gehört, daß Flüchtling G...",[0.3660313 0.5833739 0.05059479]
2,"""Es kommen kaum noch Flüchtlinge nach Griechen...",2.246076e+08,2016-04-20 21:18:58+00:00,de,3078869807f9dd36,0.0,0.0,0.0,0.0,0.0,...,16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,"""Es kommen kaum noch Flüchtlinge nach Griechen...",[0.30297568 0.4430802 0.25394407]
3,Unsere 1. Kochshow für #Flüchtlinge. Super spi...,2.480764e+09,2016-04-20 18:25:11+00:00,de,8abc99434d4f5d28,0.0,0.0,4.0,0.0,0.0,...,16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,Unsere 1. Kochshow für #Flüchtlinge. Super spi...,[0.02260253 0.07604674 0.9013506 ]
4,500 tote #Flüchtlinge im #Mittelmeer – Tragödi...,6.062653e+08,2016-04-20 16:27:28+00:00,de,e11a8b8e3771f9fa,0.0,1.0,0.0,0.0,0.0,...,16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,500 tote #Flüchtlinge im #Mittelmeer – Tragödi...,[0.6982864 0.27545568 0.02625783]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66261,"For day 1 of week 2, @AnnaMariaKonsta discusse...",1.104025e+08,2021-06-28 14:43:31+00:00,en,fcbb3c6e0a7eba22,0.0,1.0,2.0,0.0,0.0,...,26.0,6.0,2021.0,2021-26,2021-06,English,2021-06-28 00:00:00+00:00,Ukrainians,"For day 1 of week 2, @user discusses the Socia...",[0.03550766 0.82208896 0.14240335]
66262,"@ariadneconill Europe is racist, but in a diff...",2.521809e+09,2021-06-27 13:03:29+00:00,en,e385d4d639c6a423,0.0,1.0,6.0,0.0,0.0,...,25.0,6.0,2021.0,2021-26,2021-06,English,2021-06-27 00:00:00+00:00,Ukrainians,"@user Europe is racist, but in a different way...",[0.8931986 0.09380516 0.01299612]
66263,"A labour of love, inspired by Middle-earth.\n\...",5.633818e+08,2021-06-27 08:37:21+00:00,en,257640324f249a73,0.0,1.0,17.0,0.0,0.0,...,25.0,6.0,2021.0,2021-26,2021-06,English,2021-06-27 00:00:00+00:00,Ukrainians,"A labour of love, inspired by Middle-earth.\n\...",[0.15655968 0.5201203 0.32332003]
66264,@simongerman600 I must have missed the great f...,2.591892e+09,2021-06-26 08:03:22+00:00,en,000b71538f35fe46,0.0,0.0,1.0,0.0,0.0,...,25.0,6.0,2021.0,2021-25,2021-06,English,2021-06-26 00:00:00+00:00,Ukrainians,@user I must have missed the great flight of t...,[0.91320664 0.06375945 0.02303378]


In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66266 entries, 0 to 66265
Data columns (total 23 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   text                             66265 non-null  object 
 1   author_id                        66262 non-null  float64
 2   created_at                       38612 non-null  object 
 3   lang                             66262 non-null  object 
 4   geo.place_id                     38226 non-null  object 
 5   public_metrics.retweet_count     38612 non-null  float64
 6   public_metrics.reply_count       38612 non-null  float64
 7   public_metrics.like_count        38612 non-null  float64
 8   public_metrics.quote_count       38612 non-null  float64
 9   public_metrics.impression_count  38612 non-null  float64
 10  geo.coordinates.coordinates      4348 non-null   object 
 11  in_reply_to_user_id              12326 non-null  float64
 12  entities.hashtags 

### 4.3. Converting scores to string

In [25]:
# checking what dtype scores is

df.scores.dtype

dtype('O')

In [26]:
# converting scores to string

df["scores"]=df["scores"].values.astype('str')

### 4.4 Splitting the scores

In [27]:
# viewing the scores

print(df['scores'])

0        [0.0679844  0.88380396 0.0482117 ]
1        [0.3660313  0.5833739  0.05059479]
2        [0.30297568 0.4430802  0.25394407]
3        [0.02260253 0.07604674 0.9013506 ]
4        [0.6982864  0.27545568 0.02625783]
                        ...                
66261    [0.03550766 0.82208896 0.14240335]
66262    [0.8931986  0.09380516 0.01299612]
66263    [0.15655968 0.5201203  0.32332003]
66264    [0.91320664 0.06375945 0.02303378]
66265    [0.92373    0.06414817 0.01212183]
Name: scores, Length: 66266, dtype: object


Upon viewing the scores, it is apparant that there are some--seemingly random--extra spaces. These spaces need to be removed. Once this is done, the string can be split at each space.

In [28]:
# replacing multiple spcaes with a single space

df.scores = df.scores.replace(r'\s+', ' ', regex=True)

In [29]:
# viewing the scores again

print(df['scores'])

0         [0.0679844 0.88380396 0.0482117 ]
1          [0.3660313 0.5833739 0.05059479]
2         [0.30297568 0.4430802 0.25394407]
3        [0.02260253 0.07604674 0.9013506 ]
4         [0.6982864 0.27545568 0.02625783]
                        ...                
66261    [0.03550766 0.82208896 0.14240335]
66262     [0.8931986 0.09380516 0.01299612]
66263     [0.15655968 0.5201203 0.32332003]
66264    [0.91320664 0.06375945 0.02303378]
66265       [0.92373 0.06414817 0.01212183]
Name: scores, Length: 66266, dtype: object


The extra spaces have been removed. Now, the string can be turned into a list, splitting at each space.

In [30]:
df.scores=df.scores.str[1:-1].str.split(' ').tolist()

In [31]:
# viewing the scores again

print(df['scores'])

0         [0.0679844, 0.88380396, 0.0482117, ]
1           [0.3660313, 0.5833739, 0.05059479]
2          [0.30297568, 0.4430802, 0.25394407]
3        [0.02260253, 0.07604674, 0.9013506, ]
4          [0.6982864, 0.27545568, 0.02625783]
                         ...                  
66261     [0.03550766, 0.82208896, 0.14240335]
66262      [0.8931986, 0.09380516, 0.01299612]
66263      [0.15655968, 0.5201203, 0.32332003]
66264     [0.91320664, 0.06375945, 0.02303378]
66265        [0.92373, 0.06414817, 0.01212183]
Name: scores, Length: 66266, dtype: object


In [32]:
# viewing the scores as a dictionary

df.loc[0:20, "scores"].to_dict()

{0: ['0.0679844', '0.88380396', '0.0482117', ''],
 1: ['0.3660313', '0.5833739', '0.05059479'],
 2: ['0.30297568', '0.4430802', '0.25394407'],
 3: ['0.02260253', '0.07604674', '0.9013506', ''],
 4: ['0.6982864', '0.27545568', '0.02625783'],
 5: ['0.54107517', '0.37931463', '0.07961021'],
 6: ['0.7315472', '0.24781975', '0.02063307'],
 7: ['0.6039582', '0.37551644', '0.02052534'],
 8: ['0.8756516', '0.11219172', '0.01215654'],
 9: ['0.13284624', '0.8060908', '0.06106308'],
 10: ['0.4532086', '0.50662094', '0.0401704', ''],
 11: ['0.7565991', '0.22713704', '0.01626382'],
 12: ['0.7629791', '0.1803103', '0.05671053'],
 13: ['0.03224549', '0.91901845', '0.04873606'],
 14: ['0.716869', '0.26023984', '0.02289111'],
 15: ['0.0262996', '0.13661228', '0.8370881', ''],
 16: ['0.02549446', '0.897955', '0.07655057'],
 17: ['0.2620131', '0.70915294', '0.02883392'],
 18: ['0.01405706', '0.95670676', '0.02923623'],
 19: ['0.27915767', '0.7049082', '0.01593406'],
 20: ['0.76053226', '0.21511602', '0.0

When viewing the scores as a dictionary, there appear to be some extra empty list items at the end. To solve this, I will create 4 new columns--1 for each sentiment and 1 empty (i.e., Negative, Neutral, Positive, empty). The empty column will then be checked to verify that it is indeed empty and consequently be dropped.

In [33]:
# separating the list of scores into individual columns and saving it as a new dataframe, df2

df2 = pd.DataFrame(df['scores'].to_list(), columns=['Negative', 'Neutral', 'Positive', 'empty'])

In [34]:
# merging df2 to df

df = pd.concat([df, df2], axis=1)

In [35]:
# viewing the dataframe

df

Unnamed: 0,text,author_id,created_at,lang,geo.place_id,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count,public_metrics.impression_count,...,year-month,Language,date,inflow,new text,scores,Negative,Neutral,Positive,empty
0,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",4.122038e+09,2016-04-20 22:55:08+00:00,de,06d9a7c249c59bcd,0.0,0.0,0.0,0.0,0.0,...,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...","[0.0679844, 0.88380396, 0.0482117, ]",0.0679844,0.88380396,0.0482117,
1,"Habe schon lang nicht gehört, daß Flüchtling G...",1.179544e+09,2016-04-20 21:27:37+00:00,de,e99b714fe65be4fb,0.0,0.0,0.0,0.0,0.0,...,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,"Habe schon lang nicht gehört, daß Flüchtling G...","[0.3660313, 0.5833739, 0.05059479]",0.3660313,0.5833739,0.05059479,
2,"""Es kommen kaum noch Flüchtlinge nach Griechen...",2.246076e+08,2016-04-20 21:18:58+00:00,de,3078869807f9dd36,0.0,0.0,0.0,0.0,0.0,...,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,"""Es kommen kaum noch Flüchtlinge nach Griechen...","[0.30297568, 0.4430802, 0.25394407]",0.30297568,0.4430802,0.25394407,
3,Unsere 1. Kochshow für #Flüchtlinge. Super spi...,2.480764e+09,2016-04-20 18:25:11+00:00,de,8abc99434d4f5d28,0.0,0.0,4.0,0.0,0.0,...,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,Unsere 1. Kochshow für #Flüchtlinge. Super spi...,"[0.02260253, 0.07604674, 0.9013506, ]",0.02260253,0.07604674,0.9013506,
4,500 tote #Flüchtlinge im #Mittelmeer – Tragödi...,6.062653e+08,2016-04-20 16:27:28+00:00,de,e11a8b8e3771f9fa,0.0,1.0,0.0,0.0,0.0,...,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,500 tote #Flüchtlinge im #Mittelmeer – Tragödi...,"[0.6982864, 0.27545568, 0.02625783]",0.6982864,0.27545568,0.02625783,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66261,"For day 1 of week 2, @AnnaMariaKonsta discusse...",1.104025e+08,2021-06-28 14:43:31+00:00,en,fcbb3c6e0a7eba22,0.0,1.0,2.0,0.0,0.0,...,2021-06,English,2021-06-28 00:00:00+00:00,Ukrainians,"For day 1 of week 2, @user discusses the Socia...","[0.03550766, 0.82208896, 0.14240335]",0.03550766,0.82208896,0.14240335,
66262,"@ariadneconill Europe is racist, but in a diff...",2.521809e+09,2021-06-27 13:03:29+00:00,en,e385d4d639c6a423,0.0,1.0,6.0,0.0,0.0,...,2021-06,English,2021-06-27 00:00:00+00:00,Ukrainians,"@user Europe is racist, but in a different way...","[0.8931986, 0.09380516, 0.01299612]",0.8931986,0.09380516,0.01299612,
66263,"A labour of love, inspired by Middle-earth.\n\...",5.633818e+08,2021-06-27 08:37:21+00:00,en,257640324f249a73,0.0,1.0,17.0,0.0,0.0,...,2021-06,English,2021-06-27 00:00:00+00:00,Ukrainians,"A labour of love, inspired by Middle-earth.\n\...","[0.15655968, 0.5201203, 0.32332003]",0.15655968,0.5201203,0.32332003,
66264,@simongerman600 I must have missed the great f...,2.591892e+09,2021-06-26 08:03:22+00:00,en,000b71538f35fe46,0.0,0.0,1.0,0.0,0.0,...,2021-06,English,2021-06-26 00:00:00+00:00,Ukrainians,@user I must have missed the great flight of t...,"[0.91320664, 0.06375945, 0.02303378]",0.91320664,0.06375945,0.02303378,


In [36]:
# checking the unique values of the empty column

print(df['empty'].unique())

['' None]


The only values inside the empty column include 'None' or a space (i.e., ' '). Therefore, there are no scores in this column and it can be dropped.

In [37]:
df = df.drop(columns=['empty'])

In [38]:
# viewing the dataframe

df

Unnamed: 0,text,author_id,created_at,lang,geo.place_id,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count,public_metrics.impression_count,...,year-week,year-month,Language,date,inflow,new text,scores,Negative,Neutral,Positive
0,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",4.122038e+09,2016-04-20 22:55:08+00:00,de,06d9a7c249c59bcd,0.0,0.0,0.0,0.0,0.0,...,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...","[0.0679844, 0.88380396, 0.0482117, ]",0.0679844,0.88380396,0.0482117
1,"Habe schon lang nicht gehört, daß Flüchtling G...",1.179544e+09,2016-04-20 21:27:37+00:00,de,e99b714fe65be4fb,0.0,0.0,0.0,0.0,0.0,...,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,"Habe schon lang nicht gehört, daß Flüchtling G...","[0.3660313, 0.5833739, 0.05059479]",0.3660313,0.5833739,0.05059479
2,"""Es kommen kaum noch Flüchtlinge nach Griechen...",2.246076e+08,2016-04-20 21:18:58+00:00,de,3078869807f9dd36,0.0,0.0,0.0,0.0,0.0,...,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,"""Es kommen kaum noch Flüchtlinge nach Griechen...","[0.30297568, 0.4430802, 0.25394407]",0.30297568,0.4430802,0.25394407
3,Unsere 1. Kochshow für #Flüchtlinge. Super spi...,2.480764e+09,2016-04-20 18:25:11+00:00,de,8abc99434d4f5d28,0.0,0.0,4.0,0.0,0.0,...,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,Unsere 1. Kochshow für #Flüchtlinge. Super spi...,"[0.02260253, 0.07604674, 0.9013506, ]",0.02260253,0.07604674,0.9013506
4,500 tote #Flüchtlinge im #Mittelmeer – Tragödi...,6.062653e+08,2016-04-20 16:27:28+00:00,de,e11a8b8e3771f9fa,0.0,1.0,0.0,0.0,0.0,...,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,500 tote #Flüchtlinge im #Mittelmeer – Tragödi...,"[0.6982864, 0.27545568, 0.02625783]",0.6982864,0.27545568,0.02625783
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66261,"For day 1 of week 2, @AnnaMariaKonsta discusse...",1.104025e+08,2021-06-28 14:43:31+00:00,en,fcbb3c6e0a7eba22,0.0,1.0,2.0,0.0,0.0,...,2021-26,2021-06,English,2021-06-28 00:00:00+00:00,Ukrainians,"For day 1 of week 2, @user discusses the Socia...","[0.03550766, 0.82208896, 0.14240335]",0.03550766,0.82208896,0.14240335
66262,"@ariadneconill Europe is racist, but in a diff...",2.521809e+09,2021-06-27 13:03:29+00:00,en,e385d4d639c6a423,0.0,1.0,6.0,0.0,0.0,...,2021-26,2021-06,English,2021-06-27 00:00:00+00:00,Ukrainians,"@user Europe is racist, but in a different way...","[0.8931986, 0.09380516, 0.01299612]",0.8931986,0.09380516,0.01299612
66263,"A labour of love, inspired by Middle-earth.\n\...",5.633818e+08,2021-06-27 08:37:21+00:00,en,257640324f249a73,0.0,1.0,17.0,0.0,0.0,...,2021-26,2021-06,English,2021-06-27 00:00:00+00:00,Ukrainians,"A labour of love, inspired by Middle-earth.\n\...","[0.15655968, 0.5201203, 0.32332003]",0.15655968,0.5201203,0.32332003
66264,@simongerman600 I must have missed the great f...,2.591892e+09,2021-06-26 08:03:22+00:00,en,000b71538f35fe46,0.0,0.0,1.0,0.0,0.0,...,2021-25,2021-06,English,2021-06-26 00:00:00+00:00,Ukrainians,@user I must have missed the great flight of t...,"[0.91320664, 0.06375945, 0.02303378]",0.91320664,0.06375945,0.02303378


## 5. Creating CLASS column

### 5.1. Converting sentiment columns to float

#### 5.1.1. Solving error
Before creating the CLASS column, the sentiment columns (Negative, Neutral, Positive) need to be converted from string to float. However, intitial attemps to do so revealed that the Negative column has a couple of strange values. Specifically, two entries have merely the letter 'a'. The code below solves this issue and then continues converting sentiment columns to float.

In [39]:
# locating which entries have the value 'a' for the Negative column

df.loc[df['Negative'] == 'a', 'Negative']

5238    a
5239    a
Name: Negative, dtype: object

Indeces 5238 and 5239 have been identified. They will be viewed and dropped.

In [40]:
# viewing index 5238

df.iloc[5238].to_dict()

{'text': '❗❗\r#Letzte #Chance! Noch schnell @freifunkmainz beim Online-Voting für #freies #WLAN für #Flüchtlinge unterstützen https://t.co/OhCeAL77jH"',
 'author_id': 3239113792.0,
 'created_at': '2016-01-08 22:52:29+00:00',
 'lang': 'de',
 'geo.place_id': '8abc99434d4f5d28',
 'public_metrics.retweet_count': 0.0,
 'public_metrics.reply_count': 1.0,
 'public_metrics.like_count': 0.0,
 'public_metrics.quote_count': 0.0,
 'public_metrics.impression_count': 0.0,
 'geo.coordinates.coordinates': '[7.00845283, 50.97120189]',
 'in_reply_to_user_id': 176171694.0,
 'entities.hashtags': "[{'start': 3, 'end': 10, 'tag': 'Letzte'}, {'start': 11, 'end': 18, 'tag': 'Chance'}, {'start': 71, 'end': 78, 'tag': 'freies'}, {'start': 79, 'end': 84, 'tag': 'WLAN'}, {'start': 89, 'end': 101, 'tag': 'Flüchtlinge'}]",
 'week': 1.0,
 'month': 1.0,
 'year': 2016.0,
 'year-week': '2016-01',
 'year-month': '2016-01',
 'Language': 'German',
 'date': '2016-01-08 00:00:00+00:00',
 'inflow': 'Syrians',
 'new text': '❗

In [41]:
# viewing index 5239

df.iloc[5239].to_dict()

{'text': '[0.03929918 0.3001709  0.66053   ]',
 'author_id': nan,
 'created_at': nan,
 'lang': nan,
 'geo.place_id': nan,
 'public_metrics.retweet_count': nan,
 'public_metrics.reply_count': nan,
 'public_metrics.like_count': nan,
 'public_metrics.quote_count': nan,
 'public_metrics.impression_count': nan,
 'geo.coordinates.coordinates': nan,
 'in_reply_to_user_id': nan,
 'entities.hashtags': nan,
 'week': nan,
 'month': nan,
 'year': nan,
 'year-week': nan,
 'year-month': nan,
 'Language': nan,
 'date': nan,
 'inflow': nan,
 'new text': nan,
 'scores': ['a'],
 'Negative': 'a',
 'Neutral': None,
 'Positive': None}

Both entries reveal nearly identical tweets that do not have any sentiment scores. These need to be dropped.

In [42]:
# dropping index 5238

df = df.drop(df.index[5238])

**NOTE**: After dropping index 5238, index 5239 moved forward one index, hence the code below using 5238

In [43]:
# dropping index 5239

df = df.drop(df.index[5238])

In [44]:
# checking which entries still have the value 'a' for the Negative column

df.loc[df['Negative'] == 'a', 'Negative']

Series([], Name: Negative, dtype: object)

All entries with the value 'a' for the Negative column have been dropped. Conversion of the sentiment columns from string to float can continue now.

#### 5.1.2. Converting Negative to float

In [45]:
df['Negative'] = df['Negative'].astype(float)

#### 5.1.3. Converting Neutral to float

In [46]:
df['Neutral'] = df['Neutral'].astype(float)

#### 5.1.4. Converting Positive to float

In [47]:
df['Positive'] = df['Positive'].astype(float)

### 5.2. Deriving CLASS column from largest sentiment column

In [48]:
# creating a new empty column, CLASS

df['CLASS'] = ''

In [49]:
# defining a function to derive CLASS based on which sentiment column is largest

def calc_CLASS(Negative, Neutral, Positive):
    if Negative > Neutral and Negative > Positive:
        CLASS = 'NEGATIVE'
    elif Neutral > Negative and Neutral > Positive:
        CLASS = 'NEUTRAL'
    elif Positive > Negative and Positive > Neutral:
        CLASS = 'POSITIVE'
    return CLASS

In [50]:
# applying function

df['CLASS'] = df.apply(lambda x: calc_CLASS(x['Negative'], x['Neutral'], x['Positive']), 
                        axis=1)

In [51]:
# viewing dataframe

df

Unnamed: 0,text,author_id,created_at,lang,geo.place_id,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count,public_metrics.impression_count,...,year-month,Language,date,inflow,new text,scores,Negative,Neutral,Positive,CLASS
0,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",4.122038e+09,2016-04-20 22:55:08+00:00,de,06d9a7c249c59bcd,0.0,0.0,0.0,0.0,0.0,...,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...","[0.0679844, 0.88380396, 0.0482117, ]",0.067984,0.883804,0.048212,NEUTRAL
1,"Habe schon lang nicht gehört, daß Flüchtling G...",1.179544e+09,2016-04-20 21:27:37+00:00,de,e99b714fe65be4fb,0.0,0.0,0.0,0.0,0.0,...,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,"Habe schon lang nicht gehört, daß Flüchtling G...","[0.3660313, 0.5833739, 0.05059479]",0.366031,0.583374,0.050595,NEUTRAL
2,"""Es kommen kaum noch Flüchtlinge nach Griechen...",2.246076e+08,2016-04-20 21:18:58+00:00,de,3078869807f9dd36,0.0,0.0,0.0,0.0,0.0,...,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,"""Es kommen kaum noch Flüchtlinge nach Griechen...","[0.30297568, 0.4430802, 0.25394407]",0.302976,0.443080,0.253944,NEUTRAL
3,Unsere 1. Kochshow für #Flüchtlinge. Super spi...,2.480764e+09,2016-04-20 18:25:11+00:00,de,8abc99434d4f5d28,0.0,0.0,4.0,0.0,0.0,...,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,Unsere 1. Kochshow für #Flüchtlinge. Super spi...,"[0.02260253, 0.07604674, 0.9013506, ]",0.022603,0.076047,0.901351,POSITIVE
4,500 tote #Flüchtlinge im #Mittelmeer – Tragödi...,6.062653e+08,2016-04-20 16:27:28+00:00,de,e11a8b8e3771f9fa,0.0,1.0,0.0,0.0,0.0,...,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,500 tote #Flüchtlinge im #Mittelmeer – Tragödi...,"[0.6982864, 0.27545568, 0.02625783]",0.698286,0.275456,0.026258,NEGATIVE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66261,"For day 1 of week 2, @AnnaMariaKonsta discusse...",1.104025e+08,2021-06-28 14:43:31+00:00,en,fcbb3c6e0a7eba22,0.0,1.0,2.0,0.0,0.0,...,2021-06,English,2021-06-28 00:00:00+00:00,Ukrainians,"For day 1 of week 2, @user discusses the Socia...","[0.03550766, 0.82208896, 0.14240335]",0.035508,0.822089,0.142403,NEUTRAL
66262,"@ariadneconill Europe is racist, but in a diff...",2.521809e+09,2021-06-27 13:03:29+00:00,en,e385d4d639c6a423,0.0,1.0,6.0,0.0,0.0,...,2021-06,English,2021-06-27 00:00:00+00:00,Ukrainians,"@user Europe is racist, but in a different way...","[0.8931986, 0.09380516, 0.01299612]",0.893199,0.093805,0.012996,NEGATIVE
66263,"A labour of love, inspired by Middle-earth.\n\...",5.633818e+08,2021-06-27 08:37:21+00:00,en,257640324f249a73,0.0,1.0,17.0,0.0,0.0,...,2021-06,English,2021-06-27 00:00:00+00:00,Ukrainians,"A labour of love, inspired by Middle-earth.\n\...","[0.15655968, 0.5201203, 0.32332003]",0.156560,0.520120,0.323320,NEUTRAL
66264,@simongerman600 I must have missed the great f...,2.591892e+09,2021-06-26 08:03:22+00:00,en,000b71538f35fe46,0.0,0.0,1.0,0.0,0.0,...,2021-06,English,2021-06-26 00:00:00+00:00,Ukrainians,@user I must have missed the great flight of t...,"[0.91320664, 0.06375945, 0.02303378]",0.913207,0.063759,0.023034,NEGATIVE


## 6. Creating num_CLASS column

In addition to a categorical class column, CLASS, a numerical class column, num_CLASS, will be created. The value of num_CLASS will be:
* -1 if the tweet is most likely negative
* 0 if the tweet is most likely neutral
* 1 if the tweet is most likely positive

In [52]:
# creating a new empty column, num_CLASS

df['num_CLASS'] = ''

In [53]:
# defining a function to derive num_CLASS based on which sentiment column is largest

def calc_CLASS(Negative, Neutral, Positive):
    if Negative > Neutral and Negative > Positive:
        num_CLASS = -1
    elif Neutral > Negative and Neutral > Positive:
        num_CLASS = 0
    elif Positive > Negative and Positive > Neutral:
        num_CLASS = 1
    return num_CLASS

In [54]:
# applying function

df['num_CLASS'] = df.apply(lambda x: calc_CLASS(x['Negative'], x['Neutral'], x['Positive']), 
                        axis=1)

In [55]:
# viewing the dataframe

df

Unnamed: 0,text,author_id,created_at,lang,geo.place_id,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count,public_metrics.impression_count,...,Language,date,inflow,new text,scores,Negative,Neutral,Positive,CLASS,num_CLASS
0,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",4.122038e+09,2016-04-20 22:55:08+00:00,de,06d9a7c249c59bcd,0.0,0.0,0.0,0.0,0.0,...,German,2016-04-20 00:00:00+00:00,Syrians,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...","[0.0679844, 0.88380396, 0.0482117, ]",0.067984,0.883804,0.048212,NEUTRAL,0
1,"Habe schon lang nicht gehört, daß Flüchtling G...",1.179544e+09,2016-04-20 21:27:37+00:00,de,e99b714fe65be4fb,0.0,0.0,0.0,0.0,0.0,...,German,2016-04-20 00:00:00+00:00,Syrians,"Habe schon lang nicht gehört, daß Flüchtling G...","[0.3660313, 0.5833739, 0.05059479]",0.366031,0.583374,0.050595,NEUTRAL,0
2,"""Es kommen kaum noch Flüchtlinge nach Griechen...",2.246076e+08,2016-04-20 21:18:58+00:00,de,3078869807f9dd36,0.0,0.0,0.0,0.0,0.0,...,German,2016-04-20 00:00:00+00:00,Syrians,"""Es kommen kaum noch Flüchtlinge nach Griechen...","[0.30297568, 0.4430802, 0.25394407]",0.302976,0.443080,0.253944,NEUTRAL,0
3,Unsere 1. Kochshow für #Flüchtlinge. Super spi...,2.480764e+09,2016-04-20 18:25:11+00:00,de,8abc99434d4f5d28,0.0,0.0,4.0,0.0,0.0,...,German,2016-04-20 00:00:00+00:00,Syrians,Unsere 1. Kochshow für #Flüchtlinge. Super spi...,"[0.02260253, 0.07604674, 0.9013506, ]",0.022603,0.076047,0.901351,POSITIVE,1
4,500 tote #Flüchtlinge im #Mittelmeer – Tragödi...,6.062653e+08,2016-04-20 16:27:28+00:00,de,e11a8b8e3771f9fa,0.0,1.0,0.0,0.0,0.0,...,German,2016-04-20 00:00:00+00:00,Syrians,500 tote #Flüchtlinge im #Mittelmeer – Tragödi...,"[0.6982864, 0.27545568, 0.02625783]",0.698286,0.275456,0.026258,NEGATIVE,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66261,"For day 1 of week 2, @AnnaMariaKonsta discusse...",1.104025e+08,2021-06-28 14:43:31+00:00,en,fcbb3c6e0a7eba22,0.0,1.0,2.0,0.0,0.0,...,English,2021-06-28 00:00:00+00:00,Ukrainians,"For day 1 of week 2, @user discusses the Socia...","[0.03550766, 0.82208896, 0.14240335]",0.035508,0.822089,0.142403,NEUTRAL,0
66262,"@ariadneconill Europe is racist, but in a diff...",2.521809e+09,2021-06-27 13:03:29+00:00,en,e385d4d639c6a423,0.0,1.0,6.0,0.0,0.0,...,English,2021-06-27 00:00:00+00:00,Ukrainians,"@user Europe is racist, but in a different way...","[0.8931986, 0.09380516, 0.01299612]",0.893199,0.093805,0.012996,NEGATIVE,-1
66263,"A labour of love, inspired by Middle-earth.\n\...",5.633818e+08,2021-06-27 08:37:21+00:00,en,257640324f249a73,0.0,1.0,17.0,0.0,0.0,...,English,2021-06-27 00:00:00+00:00,Ukrainians,"A labour of love, inspired by Middle-earth.\n\...","[0.15655968, 0.5201203, 0.32332003]",0.156560,0.520120,0.323320,NEUTRAL,0
66264,@simongerman600 I must have missed the great f...,2.591892e+09,2021-06-26 08:03:22+00:00,en,000b71538f35fe46,0.0,0.0,1.0,0.0,0.0,...,English,2021-06-26 00:00:00+00:00,Ukrainians,@user I must have missed the great flight of t...,"[0.91320664, 0.06375945, 0.02303378]",0.913207,0.063759,0.023034,NEGATIVE,-1


## 7. Handling nulls

In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66264 entries, 0 to 66265
Data columns (total 28 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   text                             66263 non-null  object 
 1   author_id                        66261 non-null  float64
 2   created_at                       38611 non-null  object 
 3   lang                             66261 non-null  object 
 4   geo.place_id                     38225 non-null  object 
 5   public_metrics.retweet_count     38611 non-null  float64
 6   public_metrics.reply_count       38611 non-null  float64
 7   public_metrics.like_count        38611 non-null  float64
 8   public_metrics.quote_count       38611 non-null  float64
 9   public_metrics.impression_count  38611 non-null  float64
 10  geo.coordinates.coordinates      4347 non-null   object 
 11  in_reply_to_user_id              12325 non-null  float64
 12  entities.hashtags 

In [57]:
print(66264-38611)

27653


There are 27,653 rows with a missing entry for "created_at"

In [58]:
# use isnull() function to check for null values in the 'created_at' column

null_rows = df[df['created_at'].isnull()]

# print the rows with null values in the 'created_at' column

print(null_rows)

                                                    text   author_id  \
8642                                               @N24          NaN   
8643                                                 NaN         NaN   
13732  #refugeeswelcome #1209HH #Flüchtling #Syrien #...         NaN   
29750  Die Belarus-Route ist nach wie vor offen via @...  19836074.0   
29751  Die Belarus-Route ist nach wie vor offen via @...  19836074.0   
...                                                  ...         ...   
57395  Die Belarus-Route ist nach wie vor offen via @...  19836074.0   
57396  Die Belarus-Route ist nach wie vor offen via @...  19836074.0   
57397  Die Belarus-Route ist nach wie vor offen via @...  19836074.0   
57398  Die Belarus-Route ist nach wie vor offen via @...  19836074.0   
57399  Die Belarus-Route ist nach wie vor offen via @...  19836074.0   

      created_at lang geo.place_id  public_metrics.retweet_count  \
8642         NaN  NaN          NaN                           NaN   

In [59]:
# create a new dataframe with only null values in the 'created_at' column

null_rows_df = df[df['created_at'].isnull()]

In [60]:
null_rows_df

Unnamed: 0,text,author_id,created_at,lang,geo.place_id,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count,public_metrics.impression_count,...,Language,date,inflow,new text,scores,Negative,Neutral,Positive,CLASS,num_CLASS
8642,@N24,,,,,,,,,,...,German,,Syrians,@user,"[0.23433569, 0.4841369, 0.2815274, ]",0.234336,0.484137,0.281527,NEUTRAL,0
8643,,,,,,,,,,,...,German,,Syrians,,"[0.34819144, 0.39594722, 0.2558613, ]",0.348191,0.395947,0.255861,NEUTRAL,0
13732,#refugeeswelcome #1209HH #Flüchtling #Syrien #...,,,,,,,,,,...,German,,Syrians,#refugeeswelcome #1209HH #Flüchtling #Syrien #...,"[0.11344611, 0.7992048, 0.08734913]",0.113446,0.799205,0.087349,NEUTRAL,0
29750,Die Belarus-Route ist nach wie vor offen via @...,19836074.0,,de,,,,,,,...,German,,Ukrainians,Die Belarus-Route ist nach wie vor offen via @...,"[0.0282606, 0.7912057, 0.18053377]",0.028261,0.791206,0.180534,NEUTRAL,0
29751,Die Belarus-Route ist nach wie vor offen via @...,19836074.0,,de,,,,,,,...,German,,Ukrainians,Die Belarus-Route ist nach wie vor offen via @...,"[0.0282606, 0.7912057, 0.18053377]",0.028261,0.791206,0.180534,NEUTRAL,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57395,Die Belarus-Route ist nach wie vor offen via @...,19836074.0,,de,,,,,,,...,German,,Ukrainians,Die Belarus-Route ist nach wie vor offen via @...,"[0.0282606, 0.7912057, 0.18053377]",0.028261,0.791206,0.180534,NEUTRAL,0
57396,Die Belarus-Route ist nach wie vor offen via @...,19836074.0,,de,,,,,,,...,German,,Ukrainians,Die Belarus-Route ist nach wie vor offen via @...,"[0.0282606, 0.7912057, 0.18053377]",0.028261,0.791206,0.180534,NEUTRAL,0
57397,Die Belarus-Route ist nach wie vor offen via @...,19836074.0,,de,,,,,,,...,German,,Ukrainians,Die Belarus-Route ist nach wie vor offen via @...,"[0.0282606, 0.7912057, 0.18053377]",0.028261,0.791206,0.180534,NEUTRAL,0
57398,Die Belarus-Route ist nach wie vor offen via @...,19836074.0,,de,,,,,,,...,German,,Ukrainians,Die Belarus-Route ist nach wie vor offen via @...,"[0.0282606, 0.7912057, 0.18053377]",0.028261,0.791206,0.180534,NEUTRAL,0


In [61]:
null_rows_df = null_rows_df.reset_index(drop=True)

In [62]:
null_rows_df

Unnamed: 0,text,author_id,created_at,lang,geo.place_id,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count,public_metrics.impression_count,...,Language,date,inflow,new text,scores,Negative,Neutral,Positive,CLASS,num_CLASS
0,@N24,,,,,,,,,,...,German,,Syrians,@user,"[0.23433569, 0.4841369, 0.2815274, ]",0.234336,0.484137,0.281527,NEUTRAL,0
1,,,,,,,,,,,...,German,,Syrians,,"[0.34819144, 0.39594722, 0.2558613, ]",0.348191,0.395947,0.255861,NEUTRAL,0
2,#refugeeswelcome #1209HH #Flüchtling #Syrien #...,,,,,,,,,,...,German,,Syrians,#refugeeswelcome #1209HH #Flüchtling #Syrien #...,"[0.11344611, 0.7992048, 0.08734913]",0.113446,0.799205,0.087349,NEUTRAL,0
3,Die Belarus-Route ist nach wie vor offen via @...,19836074.0,,de,,,,,,,...,German,,Ukrainians,Die Belarus-Route ist nach wie vor offen via @...,"[0.0282606, 0.7912057, 0.18053377]",0.028261,0.791206,0.180534,NEUTRAL,0
4,Die Belarus-Route ist nach wie vor offen via @...,19836074.0,,de,,,,,,,...,German,,Ukrainians,Die Belarus-Route ist nach wie vor offen via @...,"[0.0282606, 0.7912057, 0.18053377]",0.028261,0.791206,0.180534,NEUTRAL,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27648,Die Belarus-Route ist nach wie vor offen via @...,19836074.0,,de,,,,,,,...,German,,Ukrainians,Die Belarus-Route ist nach wie vor offen via @...,"[0.0282606, 0.7912057, 0.18053377]",0.028261,0.791206,0.180534,NEUTRAL,0
27649,Die Belarus-Route ist nach wie vor offen via @...,19836074.0,,de,,,,,,,...,German,,Ukrainians,Die Belarus-Route ist nach wie vor offen via @...,"[0.0282606, 0.7912057, 0.18053377]",0.028261,0.791206,0.180534,NEUTRAL,0
27650,Die Belarus-Route ist nach wie vor offen via @...,19836074.0,,de,,,,,,,...,German,,Ukrainians,Die Belarus-Route ist nach wie vor offen via @...,"[0.0282606, 0.7912057, 0.18053377]",0.028261,0.791206,0.180534,NEUTRAL,0
27651,Die Belarus-Route ist nach wie vor offen via @...,19836074.0,,de,,,,,,,...,German,,Ukrainians,Die Belarus-Route ist nach wie vor offen via @...,"[0.0282606, 0.7912057, 0.18053377]",0.028261,0.791206,0.180534,NEUTRAL,0


There is at least one row (index 1) that contains no text. I will now check if there are others.

In [63]:
# use isnull() function to check for null values in the 'text' column
null_text = null_rows_df[null_rows_df['text'].isnull()]

# print the rows with null values in the 'created_at' column
print(null_text)

  text  author_id created_at lang geo.place_id  public_metrics.retweet_count  \
1  NaN        NaN        NaN  NaN          NaN                           NaN   

   public_metrics.reply_count  public_metrics.like_count  \
1                         NaN                        NaN   

   public_metrics.quote_count  public_metrics.impression_count  ... Language  \
1                         NaN                              NaN  ...   German   

   date   inflow  new text                                 scores  Negative  \
1   NaN  Syrians       NaN  [0.34819144, 0.39594722, 0.2558613, ]  0.348191   

    Neutral  Positive    CLASS num_CLASS  
1  0.395947  0.255861  NEUTRAL         0  

[1 rows x 28 columns]


In [65]:
# view the "text" entry for the row with index 0

print(null_rows_df.loc[0, 'text'])

@N24 


Index 0 contains essentially no text either.

In [66]:
# view the "text" entry for the row with index 2

print(null_rows_df.loc[2, 'text'])

#refugeeswelcome #1209HH #Flüchtling #Syrien #Wilhelmsburg 


In [67]:
print(null_rows_df.iloc[2])

text                               #refugeeswelcome #1209HH #Flüchtling #Syrien #...
author_id                                                                        NaN
created_at                                                                       NaN
lang                                                                             NaN
geo.place_id                                                                     NaN
public_metrics.retweet_count                                                     NaN
public_metrics.reply_count                                                       NaN
public_metrics.like_count                                                        NaN
public_metrics.quote_count                                                       NaN
public_metrics.impression_count                                                  NaN
geo.coordinates.coordinates                                                      NaN
in_reply_to_user_id                                              

In [68]:
# view the "text" entry for the row with index 2

print(null_rows_df.loc[3, 'text'])

Die Belarus-Route ist nach wie vor offen via @RND_de


In [69]:
print(null_rows_df.iloc[3])

text                               Die Belarus-Route ist nach wie vor offen via @...
author_id                                                                 19836074.0
created_at                                                                       NaN
lang                                                                              de
geo.place_id                                                                     NaN
public_metrics.retweet_count                                                     NaN
public_metrics.reply_count                                                       NaN
public_metrics.like_count                                                        NaN
public_metrics.quote_count                                                       NaN
public_metrics.impression_count                                                  NaN
geo.coordinates.coordinates                                                      NaN
in_reply_to_user_id                                              

The tweet as seen in index 3 appears to have been mistakenly copied over and over and excluded a lot of other information (such as created_at, etc)

In [70]:
# fill NaN values within 'text' with an empty string

null_rows_df['text'] = null_rows_df['text'].fillna('')

# count the number of rows that contain the specified text

count = len(null_rows_df[null_rows_df['text'].str.contains('Die Belarus-Route ist nach wie vor offen via @RND_de')])

# print the count

print("Number of rows that contain 'Die Belarus-Route ist nach wie vor offen via @RND_de':", count)

Number of rows that contain 'Die Belarus-Route ist nach wie vor offen via @RND_de': 27650


In [71]:
print(27653-27650)

3


Out of the 27,653 entries with a missing value for created_at, 27,650 of them contain the text "Die Belarus-Route ist nach wie vor offen via @RND_de"

I want to now see whether this tweet exists within the original dataframe but has an actual value for created_at. To do so, I will create a dataframe without any nulls for created_at and check if any row contains "Die Belarus-Route ist nach wie vor offen via @RND_de" under the column 'text'.

In [72]:
# create a new dataframe without null values in the 'created_at' column

no_nulls_df = df.dropna(subset=['created_at'], how='any')

In [73]:
no_nulls_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38611 entries, 0 to 66265
Data columns (total 28 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   text                             38611 non-null  object 
 1   author_id                        38611 non-null  float64
 2   created_at                       38611 non-null  object 
 3   lang                             38611 non-null  object 
 4   geo.place_id                     38225 non-null  object 
 5   public_metrics.retweet_count     38611 non-null  float64
 6   public_metrics.reply_count       38611 non-null  float64
 7   public_metrics.like_count        38611 non-null  float64
 8   public_metrics.quote_count       38611 non-null  float64
 9   public_metrics.impression_count  38611 non-null  float64
 10  geo.coordinates.coordinates      4347 non-null   object 
 11  in_reply_to_user_id              12325 non-null  float64
 12  entities.hashtags 

In [74]:
# count the number of rows that contain the specified text

count = len(no_nulls_df[no_nulls_df['text'].str.contains('Die Belarus-Route ist nach wie vor offen via @RND_de')])

# print the count

print("Number of rows that contain 'Die Belarus-Route ist nach wie vor offen via @RND_de':", count)

Number of rows that contain 'Die Belarus-Route ist nach wie vor offen via @RND_de': 0


Zero rows contain that tweet.

I want to see what the other 3 tweets were.

In [75]:
# removing all observations that contain 'Die Belarus-Route ist nach wie vor offen via @RND_de' as its entry for 'text'

null_rows_df = null_rows_df[~null_rows_df['text'].str.contains('Die Belarus-Route ist nach wie vor offen via @RND_de')]

In [76]:
null_rows_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 28 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   text                             3 non-null      object 
 1   author_id                        0 non-null      float64
 2   created_at                       0 non-null      object 
 3   lang                             0 non-null      object 
 4   geo.place_id                     0 non-null      object 
 5   public_metrics.retweet_count     0 non-null      float64
 6   public_metrics.reply_count       0 non-null      float64
 7   public_metrics.like_count        0 non-null      float64
 8   public_metrics.quote_count       0 non-null      float64
 9   public_metrics.impression_count  0 non-null      float64
 10  geo.coordinates.coordinates      0 non-null      object 
 11  in_reply_to_user_id              0 non-null      float64
 12  entities.hashtags         

In [77]:
null_rows_df

Unnamed: 0,text,author_id,created_at,lang,geo.place_id,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count,public_metrics.impression_count,...,Language,date,inflow,new text,scores,Negative,Neutral,Positive,CLASS,num_CLASS
0,@N24,,,,,,,,,,...,German,,Syrians,@user,"[0.23433569, 0.4841369, 0.2815274, ]",0.234336,0.484137,0.281527,NEUTRAL,0
1,,,,,,,,,,,...,German,,Syrians,,"[0.34819144, 0.39594722, 0.2558613, ]",0.348191,0.395947,0.255861,NEUTRAL,0
2,#refugeeswelcome #1209HH #Flüchtling #Syrien #...,,,,,,,,,,...,German,,Syrians,#refugeeswelcome #1209HH #Flüchtling #Syrien #...,"[0.11344611, 0.7992048, 0.08734913]",0.113446,0.799205,0.087349,NEUTRAL,0


In [78]:
# view the "text" entry for the row with index 0

print(null_rows_df.loc[1, 'text'])




The remaining tweets contain either no text, only "@N24", or only a series of hashtags. Furthermore, they do not contain any other important info, such as created_at.
Therefore, EVERY observation that had a null for created_at needs to be dropped from the original dataframe

In [79]:
# dropping all observations that have null for created_at from original dataframe

df = df.dropna(subset=['created_at'])

In [80]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38611 entries, 0 to 66265
Data columns (total 28 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   text                             38611 non-null  object 
 1   author_id                        38611 non-null  float64
 2   created_at                       38611 non-null  object 
 3   lang                             38611 non-null  object 
 4   geo.place_id                     38225 non-null  object 
 5   public_metrics.retweet_count     38611 non-null  float64
 6   public_metrics.reply_count       38611 non-null  float64
 7   public_metrics.like_count        38611 non-null  float64
 8   public_metrics.quote_count       38611 non-null  float64
 9   public_metrics.impression_count  38611 non-null  float64
 10  geo.coordinates.coordinates      4347 non-null   object 
 11  in_reply_to_user_id              12325 non-null  float64
 12  entities.hashtags 

38611 observations remain.

To summarize, 27653 observations without any useful information were dropped, leaving 38611 for the remaining dataset.

## 8. Saving the data
The data will be saved in the CASS_thesis folder as a csv titled,

> *04_Prepared-data_limited_merged.csv*

In [81]:
df.to_csv(CASS_thesis / '04_Prepared-data_limited_merged.csv')