# Data Preparation
**Author**: Andrea Cass

## 1. About this notebook
The purpose of this notebook is to make some final preparations before the data can be explored and visualized. The data used is:
> *03_Sentiment-analysis_merged.csv*

Goals:
* Seperate the set of probability scores into individual columns (Negative, Neutral, Positive)
* Create a new categorical column, CLASS, based on which sentiment classification has the largest probability score
* Create a new numerical column, _____, based on class (-1 for NEGATIVE, 0 for NEUTRAL, 1 for POSITIVE)

The output will be a single dataset saved as a csv filed titled,
> *04_Prepared-data_merged.csv*

## 2. Imports

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import csv
import os
from pathlib import Path

## 3. Working directory & file paths

Before beginning data and pre-processing, the working directory needs to be set up. You should have already created a folder called "CASS_thesis" within your desired working directory.

Two objects will be named:

* **cwd**: the current working directory (e.g., your Desktop)
* **CASS_thesis**: the folder where all data from my Notebooks will be saved

### 3.1. Current working directory
Use the code below to find out what your current working directory is set to.

In [None]:
# find current working directory

os.getcwd()

If your current working directory is not your desired directory, follow the subsequent steps to change the working diectory by:
1. deciding where you would like your working directory to be (e.g., your Desktop)
2. entering the file path of your desired working directory into the code below

**NOTE**: If you are satisfied with your working directory and do NOT wish to change it, skip the block of code underneath **3.1.1. Changing current working directory** and, instead, proceed from the block of code underneath **3.1.2. Naming current working directory**.

#### 3.1.1. Changing current working directory
**NOTE**: The code below contains the path to **my** desired working directory to serve as an example. You must alter it to the path of **your** desired working directory. Keep in mind that my example is formatted according to Macbook standards, and Windows formatting differs.

In [None]:
# changing current working directory

os.chdir('/Users/andycass/Desktop')

#### 3.1.2. Naming current working directory
Now that your current working directory is established, use the code below to name it "cwd":

In [None]:
# naming the current working directory

cwd = Path.cwd()

In [None]:
# double-checking the current working directory location

cwd

### 3.2 CASS_thesis

In [None]:
# naming the CASS_thesis folder

CASS_thesis = cwd / 'CASS_thesis'

In [None]:
# double-checking the CASS_thesis location

CASS_thesis

## 4. Separating scores
### 4.1. Loading the data

In [2]:
df = pd.read_csv(CASS_thesis / '03_Sentiment-analysis_merged.csv')

  df = pd.read_csv('/Users/andycass/Desktop/Thesis_data-and-code/1_Data/03_Sentiment-analysis_merged.csv')


### 4.2. Viewing the dataframe

In [3]:
df

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,lang,created_at,author_id,in_reply_to_user_id,text,geo.place_id,entities.hashtags,public_metrics.retweet_count,...,week,month,year,year-week,year-month,Language,date,inflow,new text,scores
0,0,0,de,2016-04-20 23:04:40+00:00,14526045,41482148,"@FrauWeh Film gesehen und nur gestaunt. Wir, a...",e11a8b8e3771f9fa,"[{'start': 126, 'end': 131, 'tag': 'OMFG'}]",0,...,16.0,4.0,2016.0,2016-16,2016-04,German,,Syrians,"@user Film gesehen und nur gestaunt. Wir, aus ...",[0.63110495 0.18820627 0.18068886]
1,1,1,de,2016-04-20 22:55:08+00:00,4122038069,,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",06d9a7c249c59bcd,,0,...,16.0,4.0,2016.0,2016-16,2016-04,German,,Syrians,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",[0.0679844 0.88380396 0.0482117 ]
2,2,2,de,2016-04-20 21:27:37+00:00,1179543852,,"Habe schon lang nicht gehört, daß Flüchtling G...",e99b714fe65be4fb,,0,...,16.0,4.0,2016.0,2016-16,2016-04,German,,Syrians,"Habe schon lang nicht gehört, daß Flüchtling G...",[0.3660313 0.5833739 0.05059479]
3,3,3,de,2016-04-20 21:18:58+00:00,224607633,,"""Es kommen kaum noch Flüchtlinge nach Griechen...",3078869807f9dd36,,0,...,16.0,4.0,2016.0,2016-16,2016-04,German,,Syrians,"""Es kommen kaum noch Flüchtlinge nach Griechen...",[0.30297568 0.4430802 0.25394407]
4,4,4,de,2016-04-20 20:56:48+00:00,3022904603,,"Verständlich, aber #Frankreich muss eigene Feh...",48504653e183c91c,"[{'start': 19, 'end': 30, 'tag': 'Frankreich'}...",0,...,16.0,4.0,2016.0,2016-16,2016-04,German,,Syrians,"Verständlich, aber #Frankreich muss eigene Feh...",[0.78610265 0.18985648 0.0240408 ]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68317,68316,68316,en,2021-06-25 10:34:05+00:00,232958476,,"The ministry of immigration , runs the biggest...",37439688c6302728,,0.0,...,25.0,6.0,2021.0,2021-25,2021-06,English,2021-06-25 00:00:00+00:00,Ukrainians,"The ministry of immigration , runs the biggest...",[0.75754243 0.21522887 0.02722873]
68318,68317,68317,en,2021-06-24 19:29:39+00:00,9474872,9474872.0,"@sudo_f @typo3 @felicity_brand Intellectually,...",8abc99434d4f5d28,,0.0,...,25.0,6.0,2021.0,2021-25,2021-06,English,2021-06-24 00:00:00+00:00,Ukrainians,"@user @user @user Intellectually, it would be ...",[0.05359944 0.20743202 0.7389685 ]
68319,68318,68318,en,2021-06-24 18:33:38+00:00,980714168,2199678761.0,@Waringphilip Agree. Immigration has done me p...,c82d9e53ae03d753,,0.0,...,25.0,6.0,2021.0,2021-25,2021-06,English,2021-06-24 00:00:00+00:00,Ukrainians,"@user Agree. Immigration has done me proud, too.",[0.01809264 0.04827376 0.9336336 ]
68320,68319,68319,en,2021-06-24 11:16:36+00:00,185889479,10809412.0,@rakyll I would love to have automatic cross z...,5bcd72da50f0ee77,,0.0,...,25.0,6.0,2021.0,2021-25,2021-06,English,2021-06-24 00:00:00+00:00,Ukrainians,@user I would love to have automatic cross zon...,[0.68358207 0.23378702 0.08263103]


### 4.3. Converting scores to string

In [4]:
# checking what dtype scores is

df.scores.dtype

dtype('O')

In [5]:
# converting scores to string

df["scores"]=df["scores"].values.astype('str')

### 4.4 Splitting the scores

In [6]:
# viewing the scores

print(df['scores'])

0        [0.63110495 0.18820627 0.18068886]
1        [0.0679844  0.88380396 0.0482117 ]
2        [0.3660313  0.5833739  0.05059479]
3        [0.30297568 0.4430802  0.25394407]
4        [0.78610265 0.18985648 0.0240408 ]
                        ...                
68317    [0.75754243 0.21522887 0.02722873]
68318    [0.05359944 0.20743202 0.7389685 ]
68319    [0.01809264 0.04827376 0.9336336 ]
68320    [0.68358207 0.23378702 0.08263103]
68321    [0.7000523  0.2591748  0.04077292]
Name: scores, Length: 68322, dtype: object


Upon viewing the scores, it is apparant that there are some--seemingly random--extra spaces. These spaces need to be removed. Once this is done, the string can be split at each space.

In [7]:
# replacing multiple spcaes with a single space

df.scores = df.scores.replace(r'\s+', ' ', regex=True)

In [8]:
# viewing the scores again

print(df['scores'])

0        [0.63110495 0.18820627 0.18068886]
1         [0.0679844 0.88380396 0.0482117 ]
2          [0.3660313 0.5833739 0.05059479]
3         [0.30297568 0.4430802 0.25394407]
4        [0.78610265 0.18985648 0.0240408 ]
                        ...                
68317    [0.75754243 0.21522887 0.02722873]
68318    [0.05359944 0.20743202 0.7389685 ]
68319    [0.01809264 0.04827376 0.9336336 ]
68320    [0.68358207 0.23378702 0.08263103]
68321      [0.7000523 0.2591748 0.04077292]
Name: scores, Length: 68322, dtype: object


The extra spaces have been removed. Now, the string can be turned into a list, splitting at each space.

In [9]:
df.scores=df.scores.str[1:-1].str.split(' ').tolist()

In [10]:
# viewing the scores again

print(df['scores'])

0         [0.63110495, 0.18820627, 0.18068886]
1         [0.0679844, 0.88380396, 0.0482117, ]
2           [0.3660313, 0.5833739, 0.05059479]
3          [0.30297568, 0.4430802, 0.25394407]
4        [0.78610265, 0.18985648, 0.0240408, ]
                         ...                  
68317     [0.75754243, 0.21522887, 0.02722873]
68318    [0.05359944, 0.20743202, 0.7389685, ]
68319    [0.01809264, 0.04827376, 0.9336336, ]
68320     [0.68358207, 0.23378702, 0.08263103]
68321       [0.7000523, 0.2591748, 0.04077292]
Name: scores, Length: 68322, dtype: object


In [12]:
# viewing the scores as a dictionary

df.loc[0:20, "scores"].to_dict()

{0: ['0.63110495', '0.18820627', '0.18068886'],
 1: ['0.0679844', '0.88380396', '0.0482117', ''],
 2: ['0.3660313', '0.5833739', '0.05059479'],
 3: ['0.30297568', '0.4430802', '0.25394407'],
 4: ['0.78610265', '0.18985648', '0.0240408', ''],
 5: ['0.02260253', '0.07604674', '0.9013506', ''],
 6: ['0.6982864', '0.27545568', '0.02625783'],
 7: ['0.56902206', '0.39868858', '0.03228934'],
 8: ['0.54107517', '0.37931463', '0.07961021'],
 9: ['0.7315472', '0.24781975', '0.02063307'],
 10: ['0.01434489', '0.9242149', '0.06144014'],
 11: ['0.6039582', '0.37551644', '0.02052534'],
 12: ['0.8756516', '0.11219172', '0.01215654'],
 13: ['0.13284624', '0.8060908', '0.06106308'],
 14: ['0.4532086', '0.50662094', '0.0401704', ''],
 15: ['0.7565991', '0.22713704', '0.01626382'],
 16: ['0.7629791', '0.1803103', '0.05671053'],
 17: ['0.03224549', '0.91901845', '0.04873606'],
 18: ['0.716869', '0.26023984', '0.02289111'],
 19: ['0.0262996', '0.13661228', '0.8370881', ''],
 20: ['0.02549446', '0.897955', 

When viewing the scores as a dictionary, there appear to be some extra empty list items at the end. To solve this, I will create 4 new columns--1 for each sentiment and 1 empty (i.e., Negative, Neutral, Positive, empty). The empty column will then be checked to verify that it is indeed empty and consequently be dropped.

In [13]:
# separating the list of scores into individual columns and saving it as a new dataframe, df2

df2 = pd.DataFrame(df['scores'].to_list(), columns=['Negative', 'Neutral', 'Positive', 'empty'])

In [14]:
# merging df2 to df

df = pd.concat([df, df2], axis=1)

In [15]:
# viewing the dataframe

df

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,lang,created_at,author_id,in_reply_to_user_id,text,geo.place_id,entities.hashtags,public_metrics.retweet_count,...,year-month,Language,date,inflow,new text,scores,Negative,Neutral,Positive,empty
0,0,0,de,2016-04-20 23:04:40+00:00,14526045,41482148,"@FrauWeh Film gesehen und nur gestaunt. Wir, a...",e11a8b8e3771f9fa,"[{'start': 126, 'end': 131, 'tag': 'OMFG'}]",0,...,2016-04,German,,Syrians,"@user Film gesehen und nur gestaunt. Wir, aus ...","[0.63110495, 0.18820627, 0.18068886]",0.63110495,0.18820627,0.18068886,
1,1,1,de,2016-04-20 22:55:08+00:00,4122038069,,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",06d9a7c249c59bcd,,0,...,2016-04,German,,Syrians,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...","[0.0679844, 0.88380396, 0.0482117, ]",0.0679844,0.88380396,0.0482117,
2,2,2,de,2016-04-20 21:27:37+00:00,1179543852,,"Habe schon lang nicht gehört, daß Flüchtling G...",e99b714fe65be4fb,,0,...,2016-04,German,,Syrians,"Habe schon lang nicht gehört, daß Flüchtling G...","[0.3660313, 0.5833739, 0.05059479]",0.3660313,0.5833739,0.05059479,
3,3,3,de,2016-04-20 21:18:58+00:00,224607633,,"""Es kommen kaum noch Flüchtlinge nach Griechen...",3078869807f9dd36,,0,...,2016-04,German,,Syrians,"""Es kommen kaum noch Flüchtlinge nach Griechen...","[0.30297568, 0.4430802, 0.25394407]",0.30297568,0.4430802,0.25394407,
4,4,4,de,2016-04-20 20:56:48+00:00,3022904603,,"Verständlich, aber #Frankreich muss eigene Feh...",48504653e183c91c,"[{'start': 19, 'end': 30, 'tag': 'Frankreich'}...",0,...,2016-04,German,,Syrians,"Verständlich, aber #Frankreich muss eigene Feh...","[0.78610265, 0.18985648, 0.0240408, ]",0.78610265,0.18985648,0.0240408,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68317,68316,68316,en,2021-06-25 10:34:05+00:00,232958476,,"The ministry of immigration , runs the biggest...",37439688c6302728,,0.0,...,2021-06,English,2021-06-25 00:00:00+00:00,Ukrainians,"The ministry of immigration , runs the biggest...","[0.75754243, 0.21522887, 0.02722873]",0.75754243,0.21522887,0.02722873,
68318,68317,68317,en,2021-06-24 19:29:39+00:00,9474872,9474872.0,"@sudo_f @typo3 @felicity_brand Intellectually,...",8abc99434d4f5d28,,0.0,...,2021-06,English,2021-06-24 00:00:00+00:00,Ukrainians,"@user @user @user Intellectually, it would be ...","[0.05359944, 0.20743202, 0.7389685, ]",0.05359944,0.20743202,0.7389685,
68319,68318,68318,en,2021-06-24 18:33:38+00:00,980714168,2199678761.0,@Waringphilip Agree. Immigration has done me p...,c82d9e53ae03d753,,0.0,...,2021-06,English,2021-06-24 00:00:00+00:00,Ukrainians,"@user Agree. Immigration has done me proud, too.","[0.01809264, 0.04827376, 0.9336336, ]",0.01809264,0.04827376,0.9336336,
68320,68319,68319,en,2021-06-24 11:16:36+00:00,185889479,10809412.0,@rakyll I would love to have automatic cross z...,5bcd72da50f0ee77,,0.0,...,2021-06,English,2021-06-24 00:00:00+00:00,Ukrainians,@user I would love to have automatic cross zon...,"[0.68358207, 0.23378702, 0.08263103]",0.68358207,0.23378702,0.08263103,


In [22]:
# checking the unique values of the empty column

print(df['empty'].unique())

[None '']


The only values inside the empty column include 'None' or a space (i.e., ' '). Therefore, there are no scores in this column and it can be dropped.

In [23]:
df = df.drop(columns=['empty'])

In [24]:
# viewing the dataframe

df

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,lang,created_at,author_id,in_reply_to_user_id,text,geo.place_id,entities.hashtags,public_metrics.retweet_count,...,year-week,year-month,Language,date,inflow,new text,scores,Negative,Neutral,Positive
0,0,0,de,2016-04-20 23:04:40+00:00,14526045,41482148,"@FrauWeh Film gesehen und nur gestaunt. Wir, a...",e11a8b8e3771f9fa,"[{'start': 126, 'end': 131, 'tag': 'OMFG'}]",0,...,2016-16,2016-04,German,,Syrians,"@user Film gesehen und nur gestaunt. Wir, aus ...","[0.63110495, 0.18820627, 0.18068886]",0.63110495,0.18820627,0.18068886
1,1,1,de,2016-04-20 22:55:08+00:00,4122038069,,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",06d9a7c249c59bcd,,0,...,2016-16,2016-04,German,,Syrians,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...","[0.0679844, 0.88380396, 0.0482117, ]",0.0679844,0.88380396,0.0482117
2,2,2,de,2016-04-20 21:27:37+00:00,1179543852,,"Habe schon lang nicht gehört, daß Flüchtling G...",e99b714fe65be4fb,,0,...,2016-16,2016-04,German,,Syrians,"Habe schon lang nicht gehört, daß Flüchtling G...","[0.3660313, 0.5833739, 0.05059479]",0.3660313,0.5833739,0.05059479
3,3,3,de,2016-04-20 21:18:58+00:00,224607633,,"""Es kommen kaum noch Flüchtlinge nach Griechen...",3078869807f9dd36,,0,...,2016-16,2016-04,German,,Syrians,"""Es kommen kaum noch Flüchtlinge nach Griechen...","[0.30297568, 0.4430802, 0.25394407]",0.30297568,0.4430802,0.25394407
4,4,4,de,2016-04-20 20:56:48+00:00,3022904603,,"Verständlich, aber #Frankreich muss eigene Feh...",48504653e183c91c,"[{'start': 19, 'end': 30, 'tag': 'Frankreich'}...",0,...,2016-16,2016-04,German,,Syrians,"Verständlich, aber #Frankreich muss eigene Feh...","[0.78610265, 0.18985648, 0.0240408, ]",0.78610265,0.18985648,0.0240408
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68317,68316,68316,en,2021-06-25 10:34:05+00:00,232958476,,"The ministry of immigration , runs the biggest...",37439688c6302728,,0.0,...,2021-25,2021-06,English,2021-06-25 00:00:00+00:00,Ukrainians,"The ministry of immigration , runs the biggest...","[0.75754243, 0.21522887, 0.02722873]",0.75754243,0.21522887,0.02722873
68318,68317,68317,en,2021-06-24 19:29:39+00:00,9474872,9474872.0,"@sudo_f @typo3 @felicity_brand Intellectually,...",8abc99434d4f5d28,,0.0,...,2021-25,2021-06,English,2021-06-24 00:00:00+00:00,Ukrainians,"@user @user @user Intellectually, it would be ...","[0.05359944, 0.20743202, 0.7389685, ]",0.05359944,0.20743202,0.7389685
68319,68318,68318,en,2021-06-24 18:33:38+00:00,980714168,2199678761.0,@Waringphilip Agree. Immigration has done me p...,c82d9e53ae03d753,,0.0,...,2021-25,2021-06,English,2021-06-24 00:00:00+00:00,Ukrainians,"@user Agree. Immigration has done me proud, too.","[0.01809264, 0.04827376, 0.9336336, ]",0.01809264,0.04827376,0.9336336
68320,68319,68319,en,2021-06-24 11:16:36+00:00,185889479,10809412.0,@rakyll I would love to have automatic cross z...,5bcd72da50f0ee77,,0.0,...,2021-25,2021-06,English,2021-06-24 00:00:00+00:00,Ukrainians,@user I would love to have automatic cross zon...,"[0.68358207, 0.23378702, 0.08263103]",0.68358207,0.23378702,0.08263103


## 5. Creating CLASS column

### 5.1. Converting sentiment columns to float

#### 5.1.1. Solving error
Before creating the CLASS column, the sentiment columns (Negative, Neutral, Positive) need to be converted from string to float. However, intitial attemps to do so revealed that the Negative column has a couple of strange values. Specifically, two entries have merely the letter 'a'. The code below solves this issue and then continues converting sentiment columns to float.

In [27]:
# locating which entries have the value 'a' for the Negative column

df.loc[df['Negative'] == 'a', 'Negative']

6130    a
6131    a
Name: Negative, dtype: object

Indeces 6130 and 6131 have been identified. They will be viewed and dropped.

In [28]:
# viewing index 6130

df.iloc[6130].to_dict()

{'Unnamed: 0.1': '6130',
 'Unnamed: 0': '6130',
 'lang': 'de',
 'created_at': '2016-01-08 22:52:29+00:00',
 'author_id': '3239113792',
 'in_reply_to_user_id': '176171694',
 'text': '❗❗\r#Letzte #Chance! Noch schnell @freifunkmainz beim Online-Voting für #freies #WLAN für #Flüchtlinge unterstützen https://t.co/OhCeAL77jH"',
 'geo.place_id': '8abc99434d4f5d28',
 'entities.hashtags': "[{'start': 3, 'end': 10, 'tag': 'Letzte'}, {'start': 11, 'end': 18, 'tag': 'Chance'}, {'start': 71, 'end': 78, 'tag': 'freies'}, {'start': 79, 'end': 84, 'tag': 'WLAN'}, {'start': 89, 'end': 101, 'tag': 'Flüchtlinge'}]",
 'public_metrics.retweet_count': '0',
 'public_metrics.reply_count': 1.0,
 'public_metrics.like_count': 0.0,
 'public_metrics.quote_count': 0.0,
 'public_metrics.impression_count': 0.0,
 'geo.coordinates.coordinates': '[7.00845283, 50.97120189]',
 'new_created_at': '2016-01-08 22:52:29',
 'week': 1.0,
 'month': 1.0,
 'year': 2016.0,
 'year-week': '2016-01',
 'year-month': '2016-01',
 'Langua

In [29]:
# viewing index 6131

df.iloc[6131].to_dict()

{'Unnamed: 0.1': '#Letzte #Chance! Noch schnell @user beim Online-Voting für #freies #WLAN für #Flüchtlinge unterstützen http',
 'Unnamed: 0': '[0.03929918 0.3001709  0.66053   ]',
 'lang': nan,
 'created_at': nan,
 'author_id': nan,
 'in_reply_to_user_id': nan,
 'text': nan,
 'geo.place_id': nan,
 'entities.hashtags': nan,
 'public_metrics.retweet_count': nan,
 'public_metrics.reply_count': nan,
 'public_metrics.like_count': nan,
 'public_metrics.quote_count': nan,
 'public_metrics.impression_count': nan,
 'geo.coordinates.coordinates': nan,
 'new_created_at': nan,
 'week': nan,
 'month': nan,
 'year': nan,
 'year-week': nan,
 'year-month': nan,
 'Language': nan,
 'date': nan,
 'inflow': nan,
 'new text': nan,
 'scores': ['a'],
 'Negative': 'a',
 'Neutral': None,
 'Positive': None}

Both entries reveal nearly identical tweets that do not have any sentiment scores. These need to be dropped.

In [30]:
# dropping index 6130

df = df.drop(df.index[6130])

In [31]:
# checking which entries still have the value 'a' for the Negative column

df.loc[df['Negative'] == 'a', 'Negative']

6131    a
Name: Negative, dtype: object

Index 6130 has been deleted and index 6131 remains.

**NOTE**: Although index 6131 is still named 6131, it has actually moved down one index due to the deletion of 6130. In other words, 6131 has now moved to index 6130. Therefore, in the subsequent code it is necessary to use the number 6130 to refer to 6131.

In [32]:
# dropping index 6131

df = df.drop(df.index[6130])

In [33]:
# hecking which entries still have the value 'a' for the Negative column

df.loc[df['Negative'] == 'a', 'Negative']

Series([], Name: Negative, dtype: object)

All entries with the value 'a' for the Negative column have been dropped. Conversion of the sentiment columns from string to float can continue now.

#### 5.1.2. Converting Negative to float

In [34]:
df['Negative'] = df['Negative'].astype(float)

#### 5.1.3. Converting Neutral to float

In [35]:
df['Neutral'] = df['Neutral'].astype(float)

#### 5.1.4. Converting Positive to float

In [36]:
df['Positive'] = df['Positive'].astype(float)

### 5.2. Deriving CLASS column from largest sentiment column

In [37]:
# creating a new empty column, CLASS

df['CLASS'] = ''

In [38]:
# defining a function to derive CLASS based on which sentiment column is largest

def calc_CLASS(Negative, Neutral, Positive):
    if Negative > Neutral and Negative > Positive:
        CLASS = 'NEGATIVE'
    elif Neutral > Negative and Neutral > Positive:
        CLASS = 'NEUTRAL'
    elif Positive > Negative and Positive > Neutral:
        CLASS = 'POSITIVE'
    return CLASS

In [39]:
# applying function

df['CLASS'] = df.apply(lambda x: calc_CLASS(x['Negative'], x['Neutral'], x['Positive']), 
                        axis=1)

In [40]:
# viewing dataframe

df

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,lang,created_at,author_id,in_reply_to_user_id,text,geo.place_id,entities.hashtags,public_metrics.retweet_count,...,year-month,Language,date,inflow,new text,scores,Negative,Neutral,Positive,CLASS
0,0,0,de,2016-04-20 23:04:40+00:00,14526045,41482148,"@FrauWeh Film gesehen und nur gestaunt. Wir, a...",e11a8b8e3771f9fa,"[{'start': 126, 'end': 131, 'tag': 'OMFG'}]",0,...,2016-04,German,,Syrians,"@user Film gesehen und nur gestaunt. Wir, aus ...","[0.63110495, 0.18820627, 0.18068886]",0.631105,0.188206,0.180689,NEGATIVE
1,1,1,de,2016-04-20 22:55:08+00:00,4122038069,,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",06d9a7c249c59bcd,,0,...,2016-04,German,,Syrians,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...","[0.0679844, 0.88380396, 0.0482117, ]",0.067984,0.883804,0.048212,NEUTRAL
2,2,2,de,2016-04-20 21:27:37+00:00,1179543852,,"Habe schon lang nicht gehört, daß Flüchtling G...",e99b714fe65be4fb,,0,...,2016-04,German,,Syrians,"Habe schon lang nicht gehört, daß Flüchtling G...","[0.3660313, 0.5833739, 0.05059479]",0.366031,0.583374,0.050595,NEUTRAL
3,3,3,de,2016-04-20 21:18:58+00:00,224607633,,"""Es kommen kaum noch Flüchtlinge nach Griechen...",3078869807f9dd36,,0,...,2016-04,German,,Syrians,"""Es kommen kaum noch Flüchtlinge nach Griechen...","[0.30297568, 0.4430802, 0.25394407]",0.302976,0.443080,0.253944,NEUTRAL
4,4,4,de,2016-04-20 20:56:48+00:00,3022904603,,"Verständlich, aber #Frankreich muss eigene Feh...",48504653e183c91c,"[{'start': 19, 'end': 30, 'tag': 'Frankreich'}...",0,...,2016-04,German,,Syrians,"Verständlich, aber #Frankreich muss eigene Feh...","[0.78610265, 0.18985648, 0.0240408, ]",0.786103,0.189856,0.024041,NEGATIVE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68317,68316,68316,en,2021-06-25 10:34:05+00:00,232958476,,"The ministry of immigration , runs the biggest...",37439688c6302728,,0.0,...,2021-06,English,2021-06-25 00:00:00+00:00,Ukrainians,"The ministry of immigration , runs the biggest...","[0.75754243, 0.21522887, 0.02722873]",0.757542,0.215229,0.027229,NEGATIVE
68318,68317,68317,en,2021-06-24 19:29:39+00:00,9474872,9474872.0,"@sudo_f @typo3 @felicity_brand Intellectually,...",8abc99434d4f5d28,,0.0,...,2021-06,English,2021-06-24 00:00:00+00:00,Ukrainians,"@user @user @user Intellectually, it would be ...","[0.05359944, 0.20743202, 0.7389685, ]",0.053599,0.207432,0.738969,POSITIVE
68319,68318,68318,en,2021-06-24 18:33:38+00:00,980714168,2199678761.0,@Waringphilip Agree. Immigration has done me p...,c82d9e53ae03d753,,0.0,...,2021-06,English,2021-06-24 00:00:00+00:00,Ukrainians,"@user Agree. Immigration has done me proud, too.","[0.01809264, 0.04827376, 0.9336336, ]",0.018093,0.048274,0.933634,POSITIVE
68320,68319,68319,en,2021-06-24 11:16:36+00:00,185889479,10809412.0,@rakyll I would love to have automatic cross z...,5bcd72da50f0ee77,,0.0,...,2021-06,English,2021-06-24 00:00:00+00:00,Ukrainians,@user I would love to have automatic cross zon...,"[0.68358207, 0.23378702, 0.08263103]",0.683582,0.233787,0.082631,NEGATIVE


## 6. Creating num_CLASS column

In addition to a categorical class column, CLASS, a numerical class column, num_CLASS, will be created. The value of num_CLASS will be:
* -1 if the tweet is most likely negative
* 0 if the tweet is most likely neutral
* 1 if the tweet is most likely positive

In [42]:
# creating a new empty column, num_CLASS

df['num_CLASS'] = ''

In [43]:
# defining a function to derive num_CLASS based on which sentiment column is largest

def calc_CLASS(Negative, Neutral, Positive):
    if Negative > Neutral and Negative > Positive:
        num_CLASS = -1
    elif Neutral > Negative and Neutral > Positive:
        num_CLASS = 0
    elif Positive > Negative and Positive > Neutral:
        num_CLASS = 1
    return num_CLASS

In [44]:
# applying function

df['num_CLASS'] = df.apply(lambda x: calc_CLASS(x['Negative'], x['Neutral'], x['Positive']), 
                        axis=1)

In [45]:
# viewing the dataframe

df

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,lang,created_at,author_id,in_reply_to_user_id,text,geo.place_id,entities.hashtags,public_metrics.retweet_count,...,Language,date,inflow,new text,scores,Negative,Neutral,Positive,CLASS,num_CLASS
0,0,0,de,2016-04-20 23:04:40+00:00,14526045,41482148,"@FrauWeh Film gesehen und nur gestaunt. Wir, a...",e11a8b8e3771f9fa,"[{'start': 126, 'end': 131, 'tag': 'OMFG'}]",0,...,German,,Syrians,"@user Film gesehen und nur gestaunt. Wir, aus ...","[0.63110495, 0.18820627, 0.18068886]",0.631105,0.188206,0.180689,NEGATIVE,-1
1,1,1,de,2016-04-20 22:55:08+00:00,4122038069,,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",06d9a7c249c59bcd,,0,...,German,,Syrians,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...","[0.0679844, 0.88380396, 0.0482117, ]",0.067984,0.883804,0.048212,NEUTRAL,0
2,2,2,de,2016-04-20 21:27:37+00:00,1179543852,,"Habe schon lang nicht gehört, daß Flüchtling G...",e99b714fe65be4fb,,0,...,German,,Syrians,"Habe schon lang nicht gehört, daß Flüchtling G...","[0.3660313, 0.5833739, 0.05059479]",0.366031,0.583374,0.050595,NEUTRAL,0
3,3,3,de,2016-04-20 21:18:58+00:00,224607633,,"""Es kommen kaum noch Flüchtlinge nach Griechen...",3078869807f9dd36,,0,...,German,,Syrians,"""Es kommen kaum noch Flüchtlinge nach Griechen...","[0.30297568, 0.4430802, 0.25394407]",0.302976,0.443080,0.253944,NEUTRAL,0
4,4,4,de,2016-04-20 20:56:48+00:00,3022904603,,"Verständlich, aber #Frankreich muss eigene Feh...",48504653e183c91c,"[{'start': 19, 'end': 30, 'tag': 'Frankreich'}...",0,...,German,,Syrians,"Verständlich, aber #Frankreich muss eigene Feh...","[0.78610265, 0.18985648, 0.0240408, ]",0.786103,0.189856,0.024041,NEGATIVE,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68317,68316,68316,en,2021-06-25 10:34:05+00:00,232958476,,"The ministry of immigration , runs the biggest...",37439688c6302728,,0.0,...,English,2021-06-25 00:00:00+00:00,Ukrainians,"The ministry of immigration , runs the biggest...","[0.75754243, 0.21522887, 0.02722873]",0.757542,0.215229,0.027229,NEGATIVE,-1
68318,68317,68317,en,2021-06-24 19:29:39+00:00,9474872,9474872.0,"@sudo_f @typo3 @felicity_brand Intellectually,...",8abc99434d4f5d28,,0.0,...,English,2021-06-24 00:00:00+00:00,Ukrainians,"@user @user @user Intellectually, it would be ...","[0.05359944, 0.20743202, 0.7389685, ]",0.053599,0.207432,0.738969,POSITIVE,1
68319,68318,68318,en,2021-06-24 18:33:38+00:00,980714168,2199678761.0,@Waringphilip Agree. Immigration has done me p...,c82d9e53ae03d753,,0.0,...,English,2021-06-24 00:00:00+00:00,Ukrainians,"@user Agree. Immigration has done me proud, too.","[0.01809264, 0.04827376, 0.9336336, ]",0.018093,0.048274,0.933634,POSITIVE,1
68320,68319,68319,en,2021-06-24 11:16:36+00:00,185889479,10809412.0,@rakyll I would love to have automatic cross z...,5bcd72da50f0ee77,,0.0,...,English,2021-06-24 00:00:00+00:00,Ukrainians,@user I would love to have automatic cross zon...,"[0.68358207, 0.23378702, 0.08263103]",0.683582,0.233787,0.082631,NEGATIVE,-1


## 7. Dropping unnecessary columns

There appear to be some new columns called 'Unnamed:0.1' and 'Unnamed:0'. These need to be dropped.

In [47]:
# viewing the column names

df.columns[0:2]

Index(['Unnamed: 0.1', 'Unnamed: 0'], dtype='object')

In [48]:
# dropping the columns

df.drop(['Unnamed: 0.1', 'Unnamed: 0'], axis=1, inplace=True)

In [49]:
# viewing the dataframe

df

Unnamed: 0,lang,created_at,author_id,in_reply_to_user_id,text,geo.place_id,entities.hashtags,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,...,Language,date,inflow,new text,scores,Negative,Neutral,Positive,CLASS,num_CLASS
0,de,2016-04-20 23:04:40+00:00,14526045,41482148,"@FrauWeh Film gesehen und nur gestaunt. Wir, a...",e11a8b8e3771f9fa,"[{'start': 126, 'end': 131, 'tag': 'OMFG'}]",0,1.0,0.0,...,German,,Syrians,"@user Film gesehen und nur gestaunt. Wir, aus ...","[0.63110495, 0.18820627, 0.18068886]",0.631105,0.188206,0.180689,NEGATIVE,-1
1,de,2016-04-20 22:55:08+00:00,4122038069,,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",06d9a7c249c59bcd,,0,0.0,0.0,...,German,,Syrians,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...","[0.0679844, 0.88380396, 0.0482117, ]",0.067984,0.883804,0.048212,NEUTRAL,0
2,de,2016-04-20 21:27:37+00:00,1179543852,,"Habe schon lang nicht gehört, daß Flüchtling G...",e99b714fe65be4fb,,0,0.0,0.0,...,German,,Syrians,"Habe schon lang nicht gehört, daß Flüchtling G...","[0.3660313, 0.5833739, 0.05059479]",0.366031,0.583374,0.050595,NEUTRAL,0
3,de,2016-04-20 21:18:58+00:00,224607633,,"""Es kommen kaum noch Flüchtlinge nach Griechen...",3078869807f9dd36,,0,0.0,0.0,...,German,,Syrians,"""Es kommen kaum noch Flüchtlinge nach Griechen...","[0.30297568, 0.4430802, 0.25394407]",0.302976,0.443080,0.253944,NEUTRAL,0
4,de,2016-04-20 20:56:48+00:00,3022904603,,"Verständlich, aber #Frankreich muss eigene Feh...",48504653e183c91c,"[{'start': 19, 'end': 30, 'tag': 'Frankreich'}...",0,0.0,0.0,...,German,,Syrians,"Verständlich, aber #Frankreich muss eigene Feh...","[0.78610265, 0.18985648, 0.0240408, ]",0.786103,0.189856,0.024041,NEGATIVE,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68317,en,2021-06-25 10:34:05+00:00,232958476,,"The ministry of immigration , runs the biggest...",37439688c6302728,,0.0,0.0,0.0,...,English,2021-06-25 00:00:00+00:00,Ukrainians,"The ministry of immigration , runs the biggest...","[0.75754243, 0.21522887, 0.02722873]",0.757542,0.215229,0.027229,NEGATIVE,-1
68318,en,2021-06-24 19:29:39+00:00,9474872,9474872.0,"@sudo_f @typo3 @felicity_brand Intellectually,...",8abc99434d4f5d28,,0.0,1.0,3.0,...,English,2021-06-24 00:00:00+00:00,Ukrainians,"@user @user @user Intellectually, it would be ...","[0.05359944, 0.20743202, 0.7389685, ]",0.053599,0.207432,0.738969,POSITIVE,1
68319,en,2021-06-24 18:33:38+00:00,980714168,2199678761.0,@Waringphilip Agree. Immigration has done me p...,c82d9e53ae03d753,,0.0,0.0,1.0,...,English,2021-06-24 00:00:00+00:00,Ukrainians,"@user Agree. Immigration has done me proud, too.","[0.01809264, 0.04827376, 0.9336336, ]",0.018093,0.048274,0.933634,POSITIVE,1
68320,en,2021-06-24 11:16:36+00:00,185889479,10809412.0,@rakyll I would love to have automatic cross z...,5bcd72da50f0ee77,,0.0,0.0,0.0,...,English,2021-06-24 00:00:00+00:00,Ukrainians,@user I would love to have automatic cross zon...,"[0.68358207, 0.23378702, 0.08263103]",0.683582,0.233787,0.082631,NEGATIVE,-1


## 8. Saving the data
The data will be saved in the CASS_thesis folder as a csv titled,

> *04_Prepared-data_merged.csv*

In [50]:
df.to_csv(CASS_thesis / '04_Prepared-data_merged.csv')