# Pre-processing - Syrian and Ukrainian inflows
**Author**: Andrea Cass

## 1. About this notebook

The purpose of this Jupyter notebook is to pre-process the data collected in the 01_Data-collection Notebooks:
> *01a_Data-Collection_Syrian_eng.csv*

> *01b_Data-Collection_Syrian_de.csv*

> *01a_Data-Collection_Ukrainian_eng.csv*

> *01b_Data-Collection_Ukrainian_de.csv*

Goals:
* Format dates
* Create new 'language' column
* Merge English- and German-language datasets into one
* Create new 'inflow' column
* Merge Syrian inflow and Ukrainian inflow datasets into one
* Dropping unnecessary columns

The output will be a single dataset saved as a csv filed titled,
> *02_Pre-processed_merged.csv*

**NOTE**: Do NOT run all cells. Section **3.2. CASS_thesis** contains two alternative steps depending on whether or not you have already created a folder called CASS_thesis using code from the first Notebook.
> **3.2.1. Creating a new folder, CASS_thesis**

> **3.2.2. Naming CASS_thesis**

Make sure to read the instructions under **3.2. CASS_thesis** to determine which of the two alternatives you should run and which you should skip. The two are mutually exclusive. That is, if you run one, you should not run the other. Code from all other sections should be run as usual.

## 2. Imports

In [1]:
import pandas as pd
from textblob import TextBlob
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from bs4 import BeautifulSoup
import re
from matplotlib import pyplot as plt
from matplotlib import dates as mpl_dates
from datetime import datetime, timedelta
import nltk
from nltk.corpus import stopwords
from textblob import Word
from datetime import datetime as dt
import os
from pathlib import Path

## 3. Working directory & file paths

Before beginning data pre-processing, the working directory needs to be set up. Additionally, if you did not already use the Notebook titled "01_Data-Collection_Syrian" to create a folder called "CASS_thesis", code is provided here to do so. 

Two objects will be named:
* **cwd**: the current working directory (e.g., your Desktop)
* **CASS_thesis**: the folder where all data from my Notebooks will be saved

### 3.1. Current working directory
Use the code below to find out what your current working directory is set to.

In [None]:
# find current working directory

os.getcwd()

If your current working directory is not your desired directory, follow the subsequent steps to change the working diectory by:
1. deciding where you would like your working directory to be (e.g., your Desktop)
2. entering the file path of your desired working directory into the code below

**NOTE**: The code below contains the path to **my** desired working directory to serve as an example. You must alter it to the path of **your** desired working directory. Keep in mind that my example is formatted according to Macbook standards, and Windows formatting differs.

**NOTE**: If you are satisfied with your working directory and do NOT wish to change it, skip the block of code underneath **3.1.1. Changing current working directory** and, instead, proceed from the block of code underneath **3.1.2. Naming current working directory**.

#### 3.1.1. Changing current working directory

In [None]:
# changing current working directory

os.chdir('/Users/andycass/Desktop')

#### 3.1.2. Naming current working directory
Now that your current working directory is established, use the code below to name it "cwd":

In [None]:
# naming the current working directory

cwd = Path.cwd()

In [None]:
# double-checking the current working directory location

cwd

### 3.2 CASS_thesis
You may or may not have already created a folder named "CASS_thesis" depending on whether you ran the code from the first Data Collection Notebook. Please carefully read the instructions below to ensure you run the code suitable for you.

* If you *HAVE* already created the CASS_thesis folder:
    1. *Skip* **3.2.1. Creating a new folder, CASS_thesis**
    2. *Proceed to* **3.2.2. Naming CASS_thesis**

* If you have *NOT* already created the CASS_thesis folder
    1. *Proceed to* **3.2.1. Creating a new folder, CASS_thesis**
    2. *Skip* **3.2.2. Naming CASS_thesis**

#### 3.2.1. Creating a new folder, CASS_thesis

**NOTE**: If you already created the CASS_thesis before, then skip this step and move to the next step starting from **3.2.2. Naming**

In [None]:
# naming the CASS_thesis folder

CASS_thesis = cwd / 'CASS_thesis'

In [None]:
# creating the CASS_thesis folder

CASS_thesis.mkdir()

In [None]:
# double-checking the CASS_thesis location

CASS_thesis

#### 3.2.2. Naming CASS_thesis

**NOTE**: If you just created the CASS_folder using the code above, then skip this step and move on to the next step starting from **4. Syrian inflow datasets**

In [None]:
# naming the CASS_thesis folder

CASS_thesis = cwd / 'CASS_thesis'

## 4. Syrian inflow datasets
### 4.1. English-language dataset
#### 4.1.1 Loading the data

In [2]:
df_eng = pd.read_csv(CASS_thesis / "01a_Data-Collection_Syrian-eng.csv")

#### 4.1.2 Viewing the dataframe

In [3]:
df_eng

Unnamed: 0,id,conversation_id,created_at,lang,author_id,edit_history_tweet_ids,possibly_sensitive,text,reply_settings,entities.mentions,...,in_reply_to_user_id,attachments.media_keys,entities.hashtags,geo.coordinates.type,geo.coordinates.coordinates,context_annotations,attachments.poll_ids,withheld.copyright,withheld.country_codes,withheld.scope
0,722876040876580864,722876040876580864,2016-04-20T19:54:13.000Z,en,339833759,['722876040876580864'],False,UNHCR - Survivors report massive loss of life ...,everyone,"[{'start': 110, 'end': 119, 'username': 'Refug...",...,,,,,,,,,,
1,722870567636828160,722717417617649665,2016-04-20T19:32:28.000Z,en,635283767,['722870567636828160'],False,@nebbia451 you think those Taiwanese hiking cl...,everyone,"[{'start': 0, 'end': 10, 'username': 'nebbia45...",...,500741275.0,,,,,,,,,
2,722802163114733572,722802163114733572,2016-04-20T15:00:39.000Z,en,412624794,['722802163114733572'],False,Syrian artists are painting bright murals in t...,everyone,"[{'start': 114, 'end': 123, 'username': 'masha...",...,,,,,,,,,,
3,722752663130152960,722752663130152960,2016-04-20T11:43:57.000Z,en,146389787,['722752663130152960'],False,Stunned that 60% of care givers in Germany hav...,everyone,,...,,,,,,,,,,
4,722712144354144256,722712144354144256,2016-04-20T09:02:57.000Z,en,2863752725,['722712144354144256'],False,Paul Guest is an excellent moderator at the co...,everyone,,...,,['3_722697412549193728'],"[{'start': 77, 'end': 84, 'tag': 'tcaepi'}]",,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7371,546823194590449664,546823194590449664,2014-12-22T00:23:03.000Z,en,186899860,['546823194590449664'],False,"DE-News : Berlin, a Russian immigrant, handpic...",everyone,,...,,,,Point,"[13.46757813, 52.5913198]",,,,,
7372,546823190874308608,546823190874308608,2014-12-22T00:23:02.000Z,en,186899860,['546823190874308608'],False,"DE-News : Sen. Marco Rubio, R-Fla., the son of...",everyone,,...,,,,Point,"[13.46757813, 52.5913198]",,,,,
7373,546699388316184577,546699388316184577,2014-12-21T16:11:06.000Z,en,186899860,['546699388316184577'],False,DE-News : There is little to break the monoton...,everyone,,...,,,,Point,"[13.46757813, 52.5913198]",,,,,
7374,546544368169918464,546544368169918464,2014-12-21T05:55:06.000Z,en,186899860,['546544368169918464'],False,DE-News : View of a vacant lot earmarked for r...,everyone,,...,,,,Point,"[13.46757813, 52.5913198]",,,,,


#### 4.1.3 Formatting dates 
The created_at column, containing information about when the tweet was posted, will be converted to datetime format and normalized so that new columns (e.g., 'year-month') can be derived from it.

In [4]:
# converting created_at to datetime format

df_eng["created_at"] = pd.to_datetime(df_eng["created_at"])

In [5]:
# converting it to date and creating a new column called "date"

df_eng['date'] = df_eng['created_at'].dt.normalize()

In [6]:
# creating week, month, year, year-week, and year-month columns

df_eng['week'] = df_eng['created_at'].dt.week
df_eng['month'] = df_eng['created_at'].dt.month
df_eng['year'] = df_eng['created_at'].dt.year
df_eng['year-week'] = df_eng['created_at'].dt.strftime('%Y-%U')
df_eng['year-month'] = df_eng['created_at'].dt.strftime('%Y-%m')

  df_eng['week'] = df_eng['created_at'].dt.week


#### 4.1.4 Viewing the dataframe

In [7]:
df_eng

Unnamed: 0,id,conversation_id,created_at,lang,author_id,edit_history_tweet_ids,possibly_sensitive,text,reply_settings,entities.mentions,...,attachments.poll_ids,withheld.copyright,withheld.country_codes,withheld.scope,date,week,month,year,year-week,year-month
0,722876040876580864,722876040876580864,2016-04-20 19:54:13+00:00,en,339833759,['722876040876580864'],False,UNHCR - Survivors report massive loss of life ...,everyone,"[{'start': 110, 'end': 119, 'username': 'Refug...",...,,,,,2016-04-20 00:00:00+00:00,16,4,2016,2016-16,2016-04
1,722870567636828160,722717417617649665,2016-04-20 19:32:28+00:00,en,635283767,['722870567636828160'],False,@nebbia451 you think those Taiwanese hiking cl...,everyone,"[{'start': 0, 'end': 10, 'username': 'nebbia45...",...,,,,,2016-04-20 00:00:00+00:00,16,4,2016,2016-16,2016-04
2,722802163114733572,722802163114733572,2016-04-20 15:00:39+00:00,en,412624794,['722802163114733572'],False,Syrian artists are painting bright murals in t...,everyone,"[{'start': 114, 'end': 123, 'username': 'masha...",...,,,,,2016-04-20 00:00:00+00:00,16,4,2016,2016-16,2016-04
3,722752663130152960,722752663130152960,2016-04-20 11:43:57+00:00,en,146389787,['722752663130152960'],False,Stunned that 60% of care givers in Germany hav...,everyone,,...,,,,,2016-04-20 00:00:00+00:00,16,4,2016,2016-16,2016-04
4,722712144354144256,722712144354144256,2016-04-20 09:02:57+00:00,en,2863752725,['722712144354144256'],False,Paul Guest is an excellent moderator at the co...,everyone,,...,,,,,2016-04-20 00:00:00+00:00,16,4,2016,2016-16,2016-04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7371,546823194590449664,546823194590449664,2014-12-22 00:23:03+00:00,en,186899860,['546823194590449664'],False,"DE-News : Berlin, a Russian immigrant, handpic...",everyone,,...,,,,,2014-12-22 00:00:00+00:00,52,12,2014,2014-51,2014-12
7372,546823190874308608,546823190874308608,2014-12-22 00:23:02+00:00,en,186899860,['546823190874308608'],False,"DE-News : Sen. Marco Rubio, R-Fla., the son of...",everyone,,...,,,,,2014-12-22 00:00:00+00:00,52,12,2014,2014-51,2014-12
7373,546699388316184577,546699388316184577,2014-12-21 16:11:06+00:00,en,186899860,['546699388316184577'],False,DE-News : There is little to break the monoton...,everyone,,...,,,,,2014-12-21 00:00:00+00:00,51,12,2014,2014-51,2014-12
7374,546544368169918464,546544368169918464,2014-12-21 05:55:06+00:00,en,186899860,['546544368169918464'],False,DE-News : View of a vacant lot earmarked for r...,everyone,,...,,,,,2014-12-21 00:00:00+00:00,51,12,2014,2014-51,2014-12


### 4.2. German-language dataset
#### 4.2.1 Loading the data

In [23]:
df_de = pd.read_csv(CASS_thesis / "01b_Data-Collection_Syrian-de.csv")

#### 4.2.2 Viewing the dataframe

In [9]:
df_de

Unnamed: 0,referenced_tweets,id,lang,created_at,conversation_id,edit_history_tweet_ids,author_id,in_reply_to_user_id,text,possibly_sensitive,...,public_metrics.like_count,public_metrics.quote_count,public_metrics.impression_count,entities.urls,geo.coordinates.type,geo.coordinates.coordinates,attachments.media_keys,context_annotations,attachments.poll_ids,entities.cashtags
0,"[{'type': 'replied_to', 'id': '722923589457068...",722923969897037824,de,2016-04-20T23:04:40.000Z,722921383605506048,['722923969897037824'],14526045,41482148,"@FrauWeh Film gesehen und nur gestaunt. Wir, a...",False,...,0.0,0.0,0.0,,,,,,,
1,,722921572810366977,de,2016-04-20T22:55:08.000Z,722921572810366977,['722921572810366977'],4122038069,,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",False,...,0.0,0.0,0.0,"[{'start': 92, 'end': 115, 'url': 'https://t.c...",,,,,,
2,,722899547039473665,de,2016-04-20T21:27:37.000Z,722899547039473665,['722899547039473665'],1179543852,,"Habe schon lang nicht gehört, daß Flüchtling G...",False,...,0.0,0.0,0.0,,Point,"[7.1468836, 50.7306348]",,,,
3,,722897370313195521,de,2016-04-20T21:18:58.000Z,722897370313195521,['722897370313195521'],224607633,,"""Es kommen kaum noch Flüchtlinge nach Griechen...",False,...,0.0,0.0,0.0,,,,,,,
4,"[{'type': 'quoted', 'id': '722860149807788032'}]",722891791771373573,de,2016-04-20T20:56:48.000Z,722891791771373573,['722891791771373573'],3022904603,,"Verständlich, aber #Frankreich muss eigene Feh...",False,...,0.0,0.0,0.0,"[{'start': 117, 'end': 140, 'url': 'https://t....",,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22414,,546652160600317952,de,2014-12-21T13:03:26.000Z,546652160600317952,['546652160600317952'],40870544,,"'Nichts gegen Flüchtlinge, aber ein Gefängnis ...",False,...,0.0,0.0,0.0,"[{'start': 75, 'end': 97, 'url': 'http://t.co/...",Point,"[13.4140765, 52.4883914]",,,,
22415,,546650806565744640,de,2014-12-21T12:58:03.000Z,546650806565744640,['546650806565744640'],390185665,,Man will christliche Werte wie Nächstenliebe d...,False,...,3.0,0.0,0.0,,Point,"[13.9980716, 51.604276]",,,,
22416,"[{'type': 'replied_to', 'id': '546637849320505...",546638863075405824,de,2014-12-21T12:10:35.000Z,546637023936016384,['546638863075405824'],1426103292,1960792351,@MartinSoechting denn diese asoziale Bagage ha...,False,...,0.0,0.0,0.0,,,,,,,
22417,,546608780914733056,de,2014-12-21T10:11:03.000Z,546608780914733056,['546608780914733056'],16484211,,"""@SZ: Pkw-Maut für Ausländer: EU-Kommissionspr...",False,...,0.0,0.0,0.0,"[{'start': 114, 'end': 136, 'url': 'http://t.c...",Point,"[6.6679867, 51.2238576]",,,,


#### 4.2.3 Formatting dates 
**NOTE**: Previous attempts to convert the created_at column to datetime format failed. Therefore, additional alternative steps were taken and are shown below.

In [12]:
# showing the first entry of the German dataset as an example

df_de.iloc[0]

referenced_tweets                  [{'type': 'replied_to', 'id': '722923589457068...
id                                                                722923969897037824
lang                                                                              de
created_at                                                  2016-04-20T23:04:40.000Z
conversation_id                                                   722921383605506048
edit_history_tweet_ids                                        ['722923969897037824']
author_id                                                                   14526045
in_reply_to_user_id                                                         41482148
text                               @FrauWeh Film gesehen und nur gestaunt. Wir, a...
possibly_sensitive                                                             False
reply_settings                                                              everyone
edit_controls.edits_remaining                                    

In [13]:
# showing the first entry of the English dataset as an example

df_eng.iloc[0]

id                                                                722876040876580864
conversation_id                                                   722876040876580864
created_at                                                 2016-04-20 19:54:13+00:00
lang                                                                              en
author_id                                                                  339833759
edit_history_tweet_ids                                        ['722876040876580864']
possibly_sensitive                                                             False
text                               UNHCR - Survivors report massive loss of life ...
reply_settings                                                              everyone
entities.mentions                  [{'start': 110, 'end': 119, 'username': 'Refug...
entities.urls                      [{'start': 82, 'end': 105, 'url': 'https://t.c...
entities.annotations               [{'start': 0, 'end': 4, 'proba

The value in the created_at column of the German dataset is:
2016-04-20T23:04:40.000Z

The value in the created_at column of the English dataset is:
2016-04-20 19:54:13+00:00

The former will be adjusted to resemble the latter.

Steps:
* Replace "T" with " "
* Replace ".000Z" with "+00:00"

In [24]:
# replacing the 'T' with ' '

df_de.created_at = df_de.created_at.replace('T', ' ', regex=True)

In [25]:
# replacing '.000Z' with '+00:00'

df_de.created_at = df_de.created_at.replace('.000Z', '+00:00', regex=True)

In [26]:
# checking the changes

df_de

Unnamed: 0,referenced_tweets,id,lang,created_at,conversation_id,edit_history_tweet_ids,author_id,in_reply_to_user_id,text,possibly_sensitive,...,public_metrics.like_count,public_metrics.quote_count,public_metrics.impression_count,entities.urls,geo.coordinates.type,geo.coordinates.coordinates,attachments.media_keys,context_annotations,attachments.poll_ids,entities.cashtags
0,"[{'type': 'replied_to', 'id': '722923589457068...",722923969897037824,de,2016-04-20 23:04:40+00:00,722921383605506048,['722923969897037824'],14526045,41482148,"@FrauWeh Film gesehen und nur gestaunt. Wir, a...",False,...,0.0,0.0,0.0,,,,,,,
1,,722921572810366977,de,2016-04-20 22:55:08+00:00,722921572810366977,['722921572810366977'],4122038069,,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",False,...,0.0,0.0,0.0,"[{'start': 92, 'end': 115, 'url': 'https://t.c...",,,,,,
2,,722899547039473665,de,2016-04-20 21:27:37+00:00,722899547039473665,['722899547039473665'],1179543852,,"Habe schon lang nicht gehört, daß Flüchtling G...",False,...,0.0,0.0,0.0,,Point,"[7.1468836, 50.7306348]",,,,
3,,722897370313195521,de,2016-04-20 21:18:58+00:00,722897370313195521,['722897370313195521'],224607633,,"""Es kommen kaum noch Flüchtlinge nach Griechen...",False,...,0.0,0.0,0.0,,,,,,,
4,"[{'type': 'quoted', 'id': '722860149807788032'}]",722891791771373573,de,2016-04-20 20:56:48+00:00,722891791771373573,['722891791771373573'],3022904603,,"Verständlich, aber #Frankreich muss eigene Feh...",False,...,0.0,0.0,0.0,"[{'start': 117, 'end': 140, 'url': 'https://t....",,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22414,,546652160600317952,de,2014-12-21 13:03:26+00:00,546652160600317952,['546652160600317952'],40870544,,"'Nichts gegen Flüchtlinge, aber ein Gefängnis ...",False,...,0.0,0.0,0.0,"[{'start': 75, 'end': 97, 'url': 'http://t.co/...",Point,"[13.4140765, 52.4883914]",,,,
22415,,546650806565744640,de,2014-12-21 12:58:03+00:00,546650806565744640,['546650806565744640'],390185665,,Man will christliche Werte wie Nächstenliebe d...,False,...,3.0,0.0,0.0,,Point,"[13.9980716, 51.604276]",,,,
22416,"[{'type': 'replied_to', 'id': '546637849320505...",546638863075405824,de,2014-12-21 12:10:35+00:00,546637023936016384,['546638863075405824'],1426103292,1960792351,@MartinSoechting denn diese asoziale Bagage ha...,False,...,0.0,0.0,0.0,,,,,,,
22417,,546608780914733056,de,2014-12-21 10:11:03+00:00,546608780914733056,['546608780914733056'],16484211,,"""@SZ: Pkw-Maut für Ausländer: EU-Kommissionspr...",False,...,0.0,0.0,0.0,"[{'start': 114, 'end': 136, 'url': 'http://t.c...",Point,"[6.6679867, 51.2238576]",,,,


Previous attempts to convert the created_at column to datetime format still brought about errors. Therefore a new column will be created based on created_at called new_created_at. Then, the new_created_at column will be converted to datetime format instead.

In [27]:
# splitting at the '+'

split = df_de["created_at"].str.split("+", n=1, expand=True)

# making a new column of created_at before the split

df_de["new_created_at"]=split[0]

In [29]:
# converting new_created_at to datetime

df_de["new_created_at"] = pd.to_datetime(df_de["new_created_at"], errors='coerce', format='%Y-%m-%d %H:%M:%S')

In [30]:
# creating week, month, and year columns

df_de['week'] = df_de['new_created_at'].dt.week
df_de['month'] = df_de['new_created_at'].dt.month
df_de['year'] = df_de['new_created_at'].dt.year
df_de['year-week'] = df_de['new_created_at'].dt.strftime('%Y-%U')
df_de['year-month'] = df_de['new_created_at'].dt.strftime('%Y-%m')

  df_de['week'] = df_de['new_created_at'].dt.week


#### 4.2.4 Viewing the dataframe

In [31]:
df_de

Unnamed: 0,referenced_tweets,id,lang,created_at,conversation_id,edit_history_tweet_ids,author_id,in_reply_to_user_id,text,possibly_sensitive,...,attachments.media_keys,context_annotations,attachments.poll_ids,entities.cashtags,new_created_at,week,month,year,year-week,year-month
0,"[{'type': 'replied_to', 'id': '722923589457068...",722923969897037824,de,2016-04-20 23:04:40+00:00,722921383605506048,['722923969897037824'],14526045,41482148,"@FrauWeh Film gesehen und nur gestaunt. Wir, a...",False,...,,,,,2016-04-20 23:04:40,16.0,4.0,2016.0,2016-16,2016-04
1,,722921572810366977,de,2016-04-20 22:55:08+00:00,722921572810366977,['722921572810366977'],4122038069,,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",False,...,,,,,2016-04-20 22:55:08,16.0,4.0,2016.0,2016-16,2016-04
2,,722899547039473665,de,2016-04-20 21:27:37+00:00,722899547039473665,['722899547039473665'],1179543852,,"Habe schon lang nicht gehört, daß Flüchtling G...",False,...,,,,,2016-04-20 21:27:37,16.0,4.0,2016.0,2016-16,2016-04
3,,722897370313195521,de,2016-04-20 21:18:58+00:00,722897370313195521,['722897370313195521'],224607633,,"""Es kommen kaum noch Flüchtlinge nach Griechen...",False,...,,,,,2016-04-20 21:18:58,16.0,4.0,2016.0,2016-16,2016-04
4,"[{'type': 'quoted', 'id': '722860149807788032'}]",722891791771373573,de,2016-04-20 20:56:48+00:00,722891791771373573,['722891791771373573'],3022904603,,"Verständlich, aber #Frankreich muss eigene Feh...",False,...,,,,,2016-04-20 20:56:48,16.0,4.0,2016.0,2016-16,2016-04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22414,,546652160600317952,de,2014-12-21 13:03:26+00:00,546652160600317952,['546652160600317952'],40870544,,"'Nichts gegen Flüchtlinge, aber ein Gefängnis ...",False,...,,,,,2014-12-21 13:03:26,51.0,12.0,2014.0,2014-51,2014-12
22415,,546650806565744640,de,2014-12-21 12:58:03+00:00,546650806565744640,['546650806565744640'],390185665,,Man will christliche Werte wie Nächstenliebe d...,False,...,,,,,2014-12-21 12:58:03,51.0,12.0,2014.0,2014-51,2014-12
22416,"[{'type': 'replied_to', 'id': '546637849320505...",546638863075405824,de,2014-12-21 12:10:35+00:00,546637023936016384,['546638863075405824'],1426103292,1960792351,@MartinSoechting denn diese asoziale Bagage ha...,False,...,,,,,2014-12-21 12:10:35,51.0,12.0,2014.0,2014-51,2014-12
22417,,546608780914733056,de,2014-12-21 10:11:03+00:00,546608780914733056,['546608780914733056'],16484211,,"""@SZ: Pkw-Maut für Ausländer: EU-Kommissionspr...",False,...,,,,,2014-12-21 10:11:03,51.0,12.0,2014.0,2014-51,2014-12


### 4.3 Merging the English-language and German-language dataframe

#### 4.3.1 Creating language column

In [32]:
df_de['Language'] = 'German'
df_eng['Language'] = 'English'

#### 4.3.2 Merging

In [33]:
# Creating a joint data frame

df_syr = pd.concat([df_de, df_eng], ignore_index = True)

#### 4.3.3 Viewing the dataframe

In [34]:
df_syr

Unnamed: 0,referenced_tweets,id,lang,created_at,conversation_id,edit_history_tweet_ids,author_id,in_reply_to_user_id,text,possibly_sensitive,...,month,year,year-week,year-month,Language,entities.annotations,withheld.copyright,withheld.country_codes,withheld.scope,date
0,"[{'type': 'replied_to', 'id': '722923589457068...",722923969897037824,de,2016-04-20 23:04:40+00:00,722921383605506048,['722923969897037824'],14526045,41482148,"@FrauWeh Film gesehen und nur gestaunt. Wir, a...",False,...,4.0,2016.0,2016-16,2016-04,German,,,,,NaT
1,,722921572810366977,de,2016-04-20 22:55:08+00:00,722921572810366977,['722921572810366977'],4122038069,,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",False,...,4.0,2016.0,2016-16,2016-04,German,,,,,NaT
2,,722899547039473665,de,2016-04-20 21:27:37+00:00,722899547039473665,['722899547039473665'],1179543852,,"Habe schon lang nicht gehört, daß Flüchtling G...",False,...,4.0,2016.0,2016-16,2016-04,German,,,,,NaT
3,,722897370313195521,de,2016-04-20 21:18:58+00:00,722897370313195521,['722897370313195521'],224607633,,"""Es kommen kaum noch Flüchtlinge nach Griechen...",False,...,4.0,2016.0,2016-16,2016-04,German,,,,,NaT
4,"[{'type': 'quoted', 'id': '722860149807788032'}]",722891791771373573,de,2016-04-20 20:56:48+00:00,722891791771373573,['722891791771373573'],3022904603,,"Verständlich, aber #Frankreich muss eigene Feh...",False,...,4.0,2016.0,2016-16,2016-04,German,,,,,NaT
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29790,,546823194590449664,en,2014-12-22 00:23:03+00:00,546823194590449664,['546823194590449664'],186899860,,"DE-News : Berlin, a Russian immigrant, handpic...",False,...,12.0,2014.0,2014-51,2014-12,English,"[{'start': 10, 'end': 15, 'probability': 0.630...",,,,2014-12-22 00:00:00+00:00
29791,,546823190874308608,en,2014-12-22 00:23:02+00:00,546823190874308608,['546823190874308608'],186899860,,"DE-News : Sen. Marco Rubio, R-Fla., the son of...",False,...,12.0,2014.0,2014-51,2014-12,English,"[{'start': 15, 'end': 25, 'probability': 0.928...",,,,2014-12-22 00:00:00+00:00
29792,,546699388316184577,en,2014-12-21 16:11:06+00:00,546699388316184577,['546699388316184577'],186899860,,DE-News : There is little to break the monoton...,False,...,12.0,2014.0,2014-51,2014-12,English,"[{'start': 3, 'end': 6, 'probability': 0.4604,...",,,,2014-12-21 00:00:00+00:00
29793,,546544368169918464,en,2014-12-21 05:55:06+00:00,546544368169918464,['546544368169918464'],186899860,,DE-News : View of a vacant lot earmarked for r...,False,...,12.0,2014.0,2014-51,2014-12,English,"[{'start': 57, 'end': 62, 'probability': 0.848...",,,,2014-12-21 00:00:00+00:00


## 5. Ukrainian inflow datasets
### 5.1. English-language dataset
#### 5.1.1 Loading the data

In [35]:
df_eng = pd.read_csv(CASS_thesis / "01a_Data-Collection_Ukrainian-eng.csv")

#### 5.1.2 Viewing the dataframe

In [36]:
df_eng

Unnamed: 0,id,possibly_sensitive,reply_settings,author_id,edit_history_tweet_ids,created_at,conversation_id,lang,text,edit_controls.edits_remaining,...,in_reply_to_user_id,entities.mentions,context_annotations,entities.urls,entities.hashtags,attachments.media_keys,geo.coordinates.type,geo.coordinates.coordinates,entities.cashtags,attachments.poll_ids
0,1584273161612578816,False,everyone,394007113,['1584273161612578816'],2022-10-23T19:58:50.000Z,1584273161612578816,en,German police got on our train to do an immigr...,5,...,,,,,,,,,,
1,1584219793670176769,False,everyone,1272454735,['1584219793670176769'],2022-10-23T16:26:46.000Z,1584148943243399168,en,@thesiriusreport The Ukrainians now boast of b...,5,...,7.017749e+17,"[{'start': 0, 'end': 16, 'username': 'thesiriu...",,,,,,,,
2,1584191961053159425,False,everyone,1121807798826930177,['1584191961053159425'],2022-10-23T14:36:11.000Z,1584191303843139584,en,@tom_username_ DPR/LNR militia did a huge part...,5,...,8.725516e+17,"[{'start': 0, 'end': 14, 'username': 'tom_user...",,,,,,,,
3,1584136624413507585,False,everyone,1070626707319767040,['1584136624413507585'],2022-10-23T10:56:17.000Z,1584136624413507585,en,"Syrians, Iraqis, Lebanese, Afghans, Yemenis, P...",5,...,,,"[{'domain': {'id': '131', 'name': 'Unified Twi...","[{'start': 287, 'end': 310, 'url': 'https://t....","[{'start': 271, 'end': 286, 'tag': 'IranRevoIu...",['3_1584136619128397824'],,,,
4,1583898887160442880,False,everyone,952890643248025600,['1583898887160442880'],2022-10-22T19:11:36.000Z,1583862017496330242,en,@grandmaster_pip What Ukrainians tried to hija...,5,...,2.177552e+09,"[{'start': 0, 'end': 16, 'username': 'grandmas...",,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3774,1408372919995146240,False,everyone,232958476,['1408372919995146240'],2021-06-25T10:34:05.000Z,1408372919995146240,en,"The ministry of immigration , runs the biggest...",5,...,,,,,,,,,,
3775,1408145312926126088,False,everyone,9474872,['1408145312926126088'],2021-06-24T19:29:39.000Z,1407248240156815360,en,"@sudo_f @typo3 @felicity_brand Intellectually,...",5,...,9.474872e+06,"[{'start': 0, 'end': 7, 'username': 'sudo_f', ...","[{'domain': {'id': '30', 'name': 'Entities [En...",,,,,,,
3776,1408131214633058311,False,everyone,980714168,['1408131214633058311'],2021-06-24T18:33:38.000Z,1407750900530171907,en,@Waringphilip Agree. Immigration has done me p...,5,...,2.199679e+09,"[{'start': 0, 'end': 13, 'username': 'Waringph...",,,,,,,,
3777,1408021231664762881,False,everyone,185889479,['1408021231664762881'],2021-06-24T11:16:36.000Z,1407754464145018882,en,@rakyll I would love to have automatic cross z...,5,...,1.080941e+07,"[{'start': 0, 'end': 7, 'username': 'rakyll', ...",,,,,,,,


#### 5.1.3 Formatting dates 

In [37]:
# converting created_at to datetime format

df_eng["created_at"] = pd.to_datetime(df_eng["created_at"])

In [38]:
# converting it to date and creating a new column called "date"

df_eng['date'] = df_eng['created_at'].dt.normalize()

In [39]:
# creating week, month, year, year-week, and year-month columns

df_eng['week'] = df_eng['created_at'].dt.week
df_eng['month'] = df_eng['created_at'].dt.month
df_eng['year'] = df_eng['created_at'].dt.year
df_eng['year-week'] = df_eng['created_at'].dt.strftime('%Y-%U')
df_eng['year-month'] = df_eng['created_at'].dt.strftime('%Y-%m')

  df_eng['week'] = df_eng['created_at'].dt.week


#### 5.1.4 Viewing the dataframe

In [40]:
df_eng

Unnamed: 0,id,possibly_sensitive,reply_settings,author_id,edit_history_tweet_ids,created_at,conversation_id,lang,text,edit_controls.edits_remaining,...,geo.coordinates.type,geo.coordinates.coordinates,entities.cashtags,attachments.poll_ids,date,week,month,year,year-week,year-month
0,1584273161612578816,False,everyone,394007113,['1584273161612578816'],2022-10-23 19:58:50+00:00,1584273161612578816,en,German police got on our train to do an immigr...,5,...,,,,,2022-10-23 00:00:00+00:00,42,10,2022,2022-43,2022-10
1,1584219793670176769,False,everyone,1272454735,['1584219793670176769'],2022-10-23 16:26:46+00:00,1584148943243399168,en,@thesiriusreport The Ukrainians now boast of b...,5,...,,,,,2022-10-23 00:00:00+00:00,42,10,2022,2022-43,2022-10
2,1584191961053159425,False,everyone,1121807798826930177,['1584191961053159425'],2022-10-23 14:36:11+00:00,1584191303843139584,en,@tom_username_ DPR/LNR militia did a huge part...,5,...,,,,,2022-10-23 00:00:00+00:00,42,10,2022,2022-43,2022-10
3,1584136624413507585,False,everyone,1070626707319767040,['1584136624413507585'],2022-10-23 10:56:17+00:00,1584136624413507585,en,"Syrians, Iraqis, Lebanese, Afghans, Yemenis, P...",5,...,,,,,2022-10-23 00:00:00+00:00,42,10,2022,2022-43,2022-10
4,1583898887160442880,False,everyone,952890643248025600,['1583898887160442880'],2022-10-22 19:11:36+00:00,1583862017496330242,en,@grandmaster_pip What Ukrainians tried to hija...,5,...,,,,,2022-10-22 00:00:00+00:00,42,10,2022,2022-42,2022-10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3774,1408372919995146240,False,everyone,232958476,['1408372919995146240'],2021-06-25 10:34:05+00:00,1408372919995146240,en,"The ministry of immigration , runs the biggest...",5,...,,,,,2021-06-25 00:00:00+00:00,25,6,2021,2021-25,2021-06
3775,1408145312926126088,False,everyone,9474872,['1408145312926126088'],2021-06-24 19:29:39+00:00,1407248240156815360,en,"@sudo_f @typo3 @felicity_brand Intellectually,...",5,...,,,,,2021-06-24 00:00:00+00:00,25,6,2021,2021-25,2021-06
3776,1408131214633058311,False,everyone,980714168,['1408131214633058311'],2021-06-24 18:33:38+00:00,1407750900530171907,en,@Waringphilip Agree. Immigration has done me p...,5,...,,,,,2021-06-24 00:00:00+00:00,25,6,2021,2021-25,2021-06
3777,1408021231664762881,False,everyone,185889479,['1408021231664762881'],2021-06-24 11:16:36+00:00,1407754464145018882,en,@rakyll I would love to have automatic cross z...,5,...,,,,,2021-06-24 00:00:00+00:00,25,6,2021,2021-25,2021-06


### 5.2. German-language dataset
#### 5.2.1 Loading the data

In [44]:
df_de = pd.read_csv(CASS_thesis / "01b_Data-Collection_Ukrainian-de.csv")

  df_de = pd.read_csv("/Users/andycass/Desktop/Thesis_data-and-code/1_Data/01b_Data-Collection_Ukrainian-de.csv")


#### 5.2.2 Viewing the dataframe

In [45]:
df_de

Unnamed: 0,conversation_id,id,lang,author_id,text,possibly_sensitive,context_annotations,edit_history_tweet_ids,reply_settings,created_at,...,entities.mentions,geo.coordinates.type,geo.coordinates.coordinates,entities.urls,entities.hashtags,attachments.media_keys,withheld.copyright,withheld.country_codes,withheld.scope,attachments.poll_ids
0,1584313222764511232,1584313222764511232,de,1508097355458961410,Die Ukraine plant eine „False Flag“ Operation ...,False,"[{'domain': {'id': '46', 'name': 'Business Tax...",['1584313222764511232'],everyone,2022-10-23T22:38:02.000Z,...,,,,,,,,,,
1,1584091067837992962,1584313179298885633,de,1498603032640167936,"@MalteKaufmann Oh wow..Ja, schlimme Zustände w...",False,,['1584313179298885633'],everyone,2022-10-23T22:37:51.000Z,...,"[{'start': 0, 'end': 14, 'username': 'MalteKau...",,,,,,,,,
2,1584264046454636544,1584264046454636544,de,16301812,"Im heutigen Video geht es um die Frage, ob man...",False,,['1584264046454636544'],everyone,2022-10-23T19:22:37.000Z,...,,Point,"[8.35736, 49.85121]","[{'start': 212, 'end': 235, 'url': 'https://t....","[{'start': 135, 'end': 143, 'tag': 'denmark'},...",,,,,
3,1583887623806160898,1584258196608192512,de,1310099812474331147,"@HasnainKazim Ach, es ist ja so einfach mit de...",False,,['1584258196608192512'],everyone,2022-10-23T18:59:22.000Z,...,"[{'start': 0, 'end': 13, 'username': 'HasnainK...",,,,,,,,,
4,1584253429714944000,1584253429714944000,de,2218012226,"Hätte er auch gewonnen, wenn er kein Ukrainer ...",False,,['1584253429714944000'],everyone,2022-10-23T18:40:26.000Z,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34743,1408289579115978758,1408289579115978758,de,3021093443,"4.700.000.000,- Euro\nfür Syrer. \nEU unterstü...",False,,['1408289579115978758'],everyone,2021-06-25T05:02:55.000Z,...,,,,,,,,,,
34744,1408174669375676419,1408182043612266497,de,1190726647231766530,@PhilipPlickert Ist das nicht rassistisch mit ...,False,,['1408182043612266497'],everyone,2021-06-24T21:55:36.000Z,...,"[{'start': 0, 'end': 15, 'username': 'PhilipPl...",,,,,,,,,
34745,1408140630149173255,1408140630149173255,de,10456882,Menschen hetzen gegen #LGBTQI und Flüchtlinge ...,False,,['1408140630149173255'],everyone,2021-06-24T19:11:02.000Z,...,,,,,"[{'start': 22, 'end': 29, 'tag': 'LGBTQI'}]",,,,,
34746,1408121056724926467,1408121056724926467,de,3310937109,Gemeinsamer Appell des Münchner Stadtrats und ...,False,,['1408121056724926467'],everyone,2021-06-24T17:53:16.000Z,...,"[{'start': 194, 'end': 197, 'username': 'SZ', ...",,,"[{'start': 212, 'end': 235, 'url': 'https://t....","[{'start': 133, 'end': 145, 'tag': 'Zuwanderun...",,,,,


#### 5.2.3 Formatting dates 
**NOTE**: When trying to convert created_at to datetime format, it was discovered that one observation had a value of "6e100b0c8dc4fa7e" for created_at due to an error during collection. This observation will be located, viewed, and dropped.

In [49]:
# locating the error

df_de.loc[df_de['created_at'] == '6e100b0c8dc4fa7e', 'created_at']

32768    6e100b0c8dc4fa7e
Name: created_at, dtype: object

The error is located at index 32768

In [51]:
# viewing the error

df_de.iloc[32768]

conversation_id                                              https://t.co/mY57RUt49f
id                                                                             False
lang                                                                             NaN
author_id                                                    ['1465568558868480001']
text                                                                        everyone
possibly_sensitive                                          2021-11-30T06:29:08.000Z
context_annotations                                                                5
edit_history_tweet_ids                                                          True
reply_settings                                              2021-11-30T06:59:08.000Z
created_at                                                          6e100b0c8dc4fa7e
edit_controls.edits_remaining                                                    0.0
edit_controls.is_edit_eligible                                   

Upon viewing index 32768, it is apparant that it indeed needs to be dropped.

In [52]:
df_de = df_de.drop(df_de.index[32768])

Now that problematic entry has been removed, formatting dates can continue.

In [53]:
# converting created_at to datetime format

df_de["created_at"] = pd.to_datetime(df_de["created_at"])

In [54]:
# converting it to date and creating a new column called "date"

df_de['date'] = df_de['created_at'].dt.normalize()

In [55]:
# creating week, month, year, year-week, and year-month columns

df_de['week'] = df_de['created_at'].dt.week
df_de['month'] = df_de['created_at'].dt.month
df_de['year'] = df_de['created_at'].dt.year
df_de['year-week'] = df_de['created_at'].dt.strftime('%Y-%U')
df_de['year-month'] = df_de['created_at'].dt.strftime('%Y-%m')

  df_de['week'] = df_de['created_at'].dt.week


#### 5.2.4 Viewing the dataframe

In [56]:
df_de

Unnamed: 0,conversation_id,id,lang,author_id,text,possibly_sensitive,context_annotations,edit_history_tweet_ids,reply_settings,created_at,...,withheld.copyright,withheld.country_codes,withheld.scope,attachments.poll_ids,date,week,month,year,year-week,year-month
0,1584313222764511232,1584313222764511232,de,1508097355458961410,Die Ukraine plant eine „False Flag“ Operation ...,False,"[{'domain': {'id': '46', 'name': 'Business Tax...",['1584313222764511232'],everyone,2022-10-23 22:38:02+00:00,...,,,,,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10
1,1584091067837992962,1584313179298885633,de,1498603032640167936,"@MalteKaufmann Oh wow..Ja, schlimme Zustände w...",False,,['1584313179298885633'],everyone,2022-10-23 22:37:51+00:00,...,,,,,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10
2,1584264046454636544,1584264046454636544,de,16301812,"Im heutigen Video geht es um die Frage, ob man...",False,,['1584264046454636544'],everyone,2022-10-23 19:22:37+00:00,...,,,,,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10
3,1583887623806160898,1584258196608192512,de,1310099812474331147,"@HasnainKazim Ach, es ist ja so einfach mit de...",False,,['1584258196608192512'],everyone,2022-10-23 18:59:22+00:00,...,,,,,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10
4,1584253429714944000,1584253429714944000,de,2218012226,"Hätte er auch gewonnen, wenn er kein Ukrainer ...",False,,['1584253429714944000'],everyone,2022-10-23 18:40:26+00:00,...,,,,,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34743,1408289579115978758,1408289579115978758,de,3021093443,"4.700.000.000,- Euro\nfür Syrer. \nEU unterstü...",False,,['1408289579115978758'],everyone,2021-06-25 05:02:55+00:00,...,,,,,2021-06-25 00:00:00+00:00,25.0,6.0,2021.0,2021-25,2021-06
34744,1408174669375676419,1408182043612266497,de,1190726647231766530,@PhilipPlickert Ist das nicht rassistisch mit ...,False,,['1408182043612266497'],everyone,2021-06-24 21:55:36+00:00,...,,,,,2021-06-24 00:00:00+00:00,25.0,6.0,2021.0,2021-25,2021-06
34745,1408140630149173255,1408140630149173255,de,10456882,Menschen hetzen gegen #LGBTQI und Flüchtlinge ...,False,,['1408140630149173255'],everyone,2021-06-24 19:11:02+00:00,...,,,,,2021-06-24 00:00:00+00:00,25.0,6.0,2021.0,2021-25,2021-06
34746,1408121056724926467,1408121056724926467,de,3310937109,Gemeinsamer Appell des Münchner Stadtrats und ...,False,,['1408121056724926467'],everyone,2021-06-24 17:53:16+00:00,...,,,,,2021-06-24 00:00:00+00:00,25.0,6.0,2021.0,2021-25,2021-06


### 5.3 Merging the English-language and German-language dataframe

#### 5.3.1 Creating language column

In [57]:
df_de['Language'] = 'German'
df_eng['Language'] = 'English'

#### 5.3.2 Merging

In [58]:
# Creating a joint data frame

df_uk = pd.concat([df_de, df_eng], ignore_index = True)

#### 5.3.3 Viewing the dataframe

In [59]:
df_uk

Unnamed: 0,conversation_id,id,lang,author_id,text,possibly_sensitive,context_annotations,edit_history_tweet_ids,reply_settings,created_at,...,attachments.poll_ids,date,week,month,year,year-week,year-month,Language,entities.annotations,entities.cashtags
0,1584313222764511232,1584313222764511232,de,1508097355458961410,Die Ukraine plant eine „False Flag“ Operation ...,False,"[{'domain': {'id': '46', 'name': 'Business Tax...",['1584313222764511232'],everyone,2022-10-23 22:38:02+00:00,...,,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10,German,,
1,1584091067837992962,1584313179298885633,de,1498603032640167936,"@MalteKaufmann Oh wow..Ja, schlimme Zustände w...",False,,['1584313179298885633'],everyone,2022-10-23 22:37:51+00:00,...,,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10,German,,
2,1584264046454636544,1584264046454636544,de,16301812,"Im heutigen Video geht es um die Frage, ob man...",False,,['1584264046454636544'],everyone,2022-10-23 19:22:37+00:00,...,,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10,German,,
3,1583887623806160898,1584258196608192512,de,1310099812474331147,"@HasnainKazim Ach, es ist ja so einfach mit de...",False,,['1584258196608192512'],everyone,2022-10-23 18:59:22+00:00,...,,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10,German,,
4,1584253429714944000,1584253429714944000,de,2218012226,"Hätte er auch gewonnen, wenn er kein Ukrainer ...",False,,['1584253429714944000'],everyone,2022-10-23 18:40:26+00:00,...,,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10,German,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38521,1408372919995146240,1408372919995146240,en,232958476,"The ministry of immigration , runs the biggest...",False,,['1408372919995146240'],everyone,2021-06-25 10:34:05+00:00,...,,2021-06-25 00:00:00+00:00,25.0,6.0,2021.0,2021-25,2021-06,English,,
38522,1407248240156815360,1408145312926126088,en,9474872,"@sudo_f @typo3 @felicity_brand Intellectually,...",False,"[{'domain': {'id': '30', 'name': 'Entities [En...",['1408145312926126088'],everyone,2021-06-24 19:29:39+00:00,...,,2021-06-24 00:00:00+00:00,25.0,6.0,2021.0,2021-25,2021-06,English,"[{'start': 171, 'end': 185, 'probability': 0.7...",
38523,1407750900530171907,1408131214633058311,en,980714168,@Waringphilip Agree. Immigration has done me p...,False,,['1408131214633058311'],everyone,2021-06-24 18:33:38+00:00,...,,2021-06-24 00:00:00+00:00,25.0,6.0,2021.0,2021-25,2021-06,English,,
38524,1407754464145018882,1408021231664762881,en,185889479,@rakyll I would love to have automatic cross z...,False,,['1408021231664762881'],everyone,2021-06-24 11:16:36+00:00,...,,2021-06-24 00:00:00+00:00,25.0,6.0,2021.0,2021-25,2021-06,English,"[{'start': 50, 'end': 52, 'probability': 0.628...",


## 6. Combining Syrian inflow and Ukrainian inflow  dataframes

### 6.1. Creating inflow column

In [60]:
# Adding a new column, inflow, indicating Syrians or Ukrainians

df_syr['inflow'] = 'Syrians'
df_uk['inflow'] = 'Ukrainians'

### 6.2. Viewing the dataframes

In [61]:
df_syr

Unnamed: 0,referenced_tweets,id,lang,created_at,conversation_id,edit_history_tweet_ids,author_id,in_reply_to_user_id,text,possibly_sensitive,...,year,year-week,year-month,Language,entities.annotations,withheld.copyright,withheld.country_codes,withheld.scope,date,inflow
0,"[{'type': 'replied_to', 'id': '722923589457068...",722923969897037824,de,2016-04-20 23:04:40+00:00,722921383605506048,['722923969897037824'],14526045,41482148,"@FrauWeh Film gesehen und nur gestaunt. Wir, a...",False,...,2016.0,2016-16,2016-04,German,,,,,NaT,Syrians
1,,722921572810366977,de,2016-04-20 22:55:08+00:00,722921572810366977,['722921572810366977'],4122038069,,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",False,...,2016.0,2016-16,2016-04,German,,,,,NaT,Syrians
2,,722899547039473665,de,2016-04-20 21:27:37+00:00,722899547039473665,['722899547039473665'],1179543852,,"Habe schon lang nicht gehört, daß Flüchtling G...",False,...,2016.0,2016-16,2016-04,German,,,,,NaT,Syrians
3,,722897370313195521,de,2016-04-20 21:18:58+00:00,722897370313195521,['722897370313195521'],224607633,,"""Es kommen kaum noch Flüchtlinge nach Griechen...",False,...,2016.0,2016-16,2016-04,German,,,,,NaT,Syrians
4,"[{'type': 'quoted', 'id': '722860149807788032'}]",722891791771373573,de,2016-04-20 20:56:48+00:00,722891791771373573,['722891791771373573'],3022904603,,"Verständlich, aber #Frankreich muss eigene Feh...",False,...,2016.0,2016-16,2016-04,German,,,,,NaT,Syrians
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29790,,546823194590449664,en,2014-12-22 00:23:03+00:00,546823194590449664,['546823194590449664'],186899860,,"DE-News : Berlin, a Russian immigrant, handpic...",False,...,2014.0,2014-51,2014-12,English,"[{'start': 10, 'end': 15, 'probability': 0.630...",,,,2014-12-22 00:00:00+00:00,Syrians
29791,,546823190874308608,en,2014-12-22 00:23:02+00:00,546823190874308608,['546823190874308608'],186899860,,"DE-News : Sen. Marco Rubio, R-Fla., the son of...",False,...,2014.0,2014-51,2014-12,English,"[{'start': 15, 'end': 25, 'probability': 0.928...",,,,2014-12-22 00:00:00+00:00,Syrians
29792,,546699388316184577,en,2014-12-21 16:11:06+00:00,546699388316184577,['546699388316184577'],186899860,,DE-News : There is little to break the monoton...,False,...,2014.0,2014-51,2014-12,English,"[{'start': 3, 'end': 6, 'probability': 0.4604,...",,,,2014-12-21 00:00:00+00:00,Syrians
29793,,546544368169918464,en,2014-12-21 05:55:06+00:00,546544368169918464,['546544368169918464'],186899860,,DE-News : View of a vacant lot earmarked for r...,False,...,2014.0,2014-51,2014-12,English,"[{'start': 57, 'end': 62, 'probability': 0.848...",,,,2014-12-21 00:00:00+00:00,Syrians


In [62]:
df_uk

Unnamed: 0,conversation_id,id,lang,author_id,text,possibly_sensitive,context_annotations,edit_history_tweet_ids,reply_settings,created_at,...,date,week,month,year,year-week,year-month,Language,entities.annotations,entities.cashtags,inflow
0,1584313222764511232,1584313222764511232,de,1508097355458961410,Die Ukraine plant eine „False Flag“ Operation ...,False,"[{'domain': {'id': '46', 'name': 'Business Tax...",['1584313222764511232'],everyone,2022-10-23 22:38:02+00:00,...,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10,German,,,Ukrainians
1,1584091067837992962,1584313179298885633,de,1498603032640167936,"@MalteKaufmann Oh wow..Ja, schlimme Zustände w...",False,,['1584313179298885633'],everyone,2022-10-23 22:37:51+00:00,...,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10,German,,,Ukrainians
2,1584264046454636544,1584264046454636544,de,16301812,"Im heutigen Video geht es um die Frage, ob man...",False,,['1584264046454636544'],everyone,2022-10-23 19:22:37+00:00,...,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10,German,,,Ukrainians
3,1583887623806160898,1584258196608192512,de,1310099812474331147,"@HasnainKazim Ach, es ist ja so einfach mit de...",False,,['1584258196608192512'],everyone,2022-10-23 18:59:22+00:00,...,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10,German,,,Ukrainians
4,1584253429714944000,1584253429714944000,de,2218012226,"Hätte er auch gewonnen, wenn er kein Ukrainer ...",False,,['1584253429714944000'],everyone,2022-10-23 18:40:26+00:00,...,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10,German,,,Ukrainians
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38521,1408372919995146240,1408372919995146240,en,232958476,"The ministry of immigration , runs the biggest...",False,,['1408372919995146240'],everyone,2021-06-25 10:34:05+00:00,...,2021-06-25 00:00:00+00:00,25.0,6.0,2021.0,2021-25,2021-06,English,,,Ukrainians
38522,1407248240156815360,1408145312926126088,en,9474872,"@sudo_f @typo3 @felicity_brand Intellectually,...",False,"[{'domain': {'id': '30', 'name': 'Entities [En...",['1408145312926126088'],everyone,2021-06-24 19:29:39+00:00,...,2021-06-24 00:00:00+00:00,25.0,6.0,2021.0,2021-25,2021-06,English,"[{'start': 171, 'end': 185, 'probability': 0.7...",,Ukrainians
38523,1407750900530171907,1408131214633058311,en,980714168,@Waringphilip Agree. Immigration has done me p...,False,,['1408131214633058311'],everyone,2021-06-24 18:33:38+00:00,...,2021-06-24 00:00:00+00:00,25.0,6.0,2021.0,2021-25,2021-06,English,,,Ukrainians
38524,1407754464145018882,1408021231664762881,en,185889479,@rakyll I would love to have automatic cross z...,False,,['1408021231664762881'],everyone,2021-06-24 11:16:36+00:00,...,2021-06-24 00:00:00+00:00,25.0,6.0,2021.0,2021-25,2021-06,English,"[{'start': 50, 'end': 52, 'probability': 0.628...",,Ukrainians


### 6.3. Merging 

In [63]:
# Creating a joint data frame

df = pd.concat([df_syr, df_uk], ignore_index = True)

### 6.4 Viewing the dataframe

In [64]:
df

Unnamed: 0,referenced_tweets,id,lang,created_at,conversation_id,edit_history_tweet_ids,author_id,in_reply_to_user_id,text,possibly_sensitive,...,year,year-week,year-month,Language,entities.annotations,withheld.copyright,withheld.country_codes,withheld.scope,date,inflow
0,"[{'type': 'replied_to', 'id': '722923589457068...",722923969897037824,de,2016-04-20 23:04:40+00:00,722921383605506048,['722923969897037824'],14526045,41482148,"@FrauWeh Film gesehen und nur gestaunt. Wir, a...",False,...,2016.0,2016-16,2016-04,German,,,,,NaT,Syrians
1,,722921572810366977,de,2016-04-20 22:55:08+00:00,722921572810366977,['722921572810366977'],4122038069,,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",False,...,2016.0,2016-16,2016-04,German,,,,,NaT,Syrians
2,,722899547039473665,de,2016-04-20 21:27:37+00:00,722899547039473665,['722899547039473665'],1179543852,,"Habe schon lang nicht gehört, daß Flüchtling G...",False,...,2016.0,2016-16,2016-04,German,,,,,NaT,Syrians
3,,722897370313195521,de,2016-04-20 21:18:58+00:00,722897370313195521,['722897370313195521'],224607633,,"""Es kommen kaum noch Flüchtlinge nach Griechen...",False,...,2016.0,2016-16,2016-04,German,,,,,NaT,Syrians
4,"[{'type': 'quoted', 'id': '722860149807788032'}]",722891791771373573,de,2016-04-20 20:56:48+00:00,722891791771373573,['722891791771373573'],3022904603,,"Verständlich, aber #Frankreich muss eigene Feh...",False,...,2016.0,2016-16,2016-04,German,,,,,NaT,Syrians
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68316,,1408372919995146240,en,2021-06-25 10:34:05+00:00,1408372919995146240,['1408372919995146240'],232958476,,"The ministry of immigration , runs the biggest...",False,...,2021.0,2021-25,2021-06,English,,,,,2021-06-25 00:00:00+00:00,Ukrainians
68317,"[{'type': 'replied_to', 'id': '140814441498846...",1408145312926126088,en,2021-06-24 19:29:39+00:00,1407248240156815360,['1408145312926126088'],9474872,9474872.0,"@sudo_f @typo3 @felicity_brand Intellectually,...",False,...,2021.0,2021-25,2021-06,English,"[{'start': 171, 'end': 185, 'probability': 0.7...",,,,2021-06-24 00:00:00+00:00,Ukrainians
68318,"[{'type': 'replied_to', 'id': '140813094101754...",1408131214633058311,en,2021-06-24 18:33:38+00:00,1407750900530171907,['1408131214633058311'],980714168,2199678761.0,@Waringphilip Agree. Immigration has done me p...,False,...,2021.0,2021-25,2021-06,English,,,,,2021-06-24 00:00:00+00:00,Ukrainians
68319,"[{'type': 'replied_to', 'id': '140775446414501...",1408021231664762881,en,2021-06-24 11:16:36+00:00,1407754464145018882,['1408021231664762881'],185889479,10809412.0,@rakyll I would love to have automatic cross z...,False,...,2021.0,2021-25,2021-06,English,"[{'start': 50, 'end': 52, 'probability': 0.628...",,,,2021-06-24 00:00:00+00:00,Ukrainians


## 7. Finding and dropping unnecessary columns
Several columns are unnecessary for the analysis and will be dropped.
Steps:
* View all column names
* Inspect some of the columns to see if they are needed or not
* Drop the columns identified as unnecessary

### 7.1. Viewing all column names

In [65]:
# viewing all column names

df.columns

Index(['referenced_tweets', 'id', 'lang', 'created_at', 'conversation_id',
       'edit_history_tweet_ids', 'author_id', 'in_reply_to_user_id', 'text',
       'possibly_sensitive', 'reply_settings', 'edit_controls.edits_remaining',
       'edit_controls.is_edit_eligible', 'edit_controls.editable_until',
       'geo.place_id', 'entities.hashtags', 'entities.mentions',
       'public_metrics.retweet_count', 'public_metrics.reply_count',
       'public_metrics.like_count', 'public_metrics.quote_count',
       'public_metrics.impression_count', 'entities.urls',
       'geo.coordinates.type', 'geo.coordinates.coordinates',
       'attachments.media_keys', 'context_annotations', 'attachments.poll_ids',
       'entities.cashtags', 'new_created_at', 'week', 'month', 'year',
       'year-week', 'year-month', 'Language', 'entities.annotations',
       'withheld.copyright', 'withheld.country_codes', 'withheld.scope',
       'date', 'inflow'],
      dtype='object')

### 7.2. Inspecting column contents

In [66]:
# inspecting column 'reply_settings' (only looking at first 5 entries)

print(df['reply_settings'][0:5])

0    everyone
1    everyone
2    everyone
3    everyone
4    everyone
Name: reply_settings, dtype: object


In [67]:
# inspecting column 'edit_controls.edits_remaining' (only looking at first 5 entries)

print(df['edit_controls.edits_remaining'][0:5])

0    5.0
1    5.0
2    5.0
3    5.0
4    5.0
Name: edit_controls.edits_remaining, dtype: float64


In [None]:
# inspecting column 'edit_controls.edits_remaining' (only looking at first 5 entries)

print(df['edit_controls.edits_remaining'][0:5])

In [69]:
# inspecting column 'entities.hashtags' (only looking at first 20 entries)

print(df['entities.hashtags'][0:20])

0           [{'start': 126, 'end': 131, 'tag': 'OMFG'}]
1                                                   NaN
2                                                   NaN
3                                                   NaN
4     [{'start': 19, 'end': 30, 'tag': 'Frankreich'}...
5      [{'start': 23, 'end': 35, 'tag': 'Flüchtlinge'}]
6     [{'start': 9, 'end': 21, 'tag': 'Flüchtlinge'}...
7                                                   NaN
8                                                   NaN
9     [{'start': 0, 'end': 7, 'tag': 'Boehmi'}, {'st...
10    [{'start': 28, 'end': 38, 'tag': 'migration'},...
11    [{'start': 0, 'end': 8, 'tag': 'Bamberg'}, {'s...
12    [{'start': 64, 'end': 75, 'tag': 'Mittelmeer'}...
13                                                  NaN
14                                                  NaN
15                                                  NaN
16                                                  NaN
17                                              

In [70]:
# inspecting column 'entities.mentions' (only looking at first 20 entries)

print(df['entities.mentions'][0:20])

0     [{'start': 0, 'end': 8, 'username': 'FrauWeh',...
1                                                   NaN
2                                                   NaN
3                                                   NaN
4     [{'start': 78, 'end': 92, 'username': 'spiegel...
5                                                   NaN
6                                                   NaN
7                                                   NaN
8     [{'start': 0, 'end': 8, 'username': 'hataibu',...
9                                                   NaN
10    [{'start': 123, 'end': 138, 'username': 'Krist...
11                                                  NaN
12                                                  NaN
13                                                  NaN
14                                                  NaN
15    [{'start': 96, 'end': 101, 'username': 'welt',...
16    [{'start': 96, 'end': 101, 'username': 'welt',...
17                                              

In [71]:
# inspecting column 'entities.urls' (only looking at first 20 entries)

print(df['entities.urls'][0:20])

0                                                   NaN
1     [{'start': 92, 'end': 115, 'url': 'https://t.c...
2                                                   NaN
3                                                   NaN
4     [{'start': 117, 'end': 140, 'url': 'https://t....
5     [{'start': 83, 'end': 106, 'url': 'https://t.c...
6     [{'start': 92, 'end': 115, 'url': 'https://t.c...
7     [{'start': 115, 'end': 138, 'url': 'https://t....
8                                                   NaN
9                                                   NaN
10    [{'start': 93, 'end': 116, 'url': 'https://t.c...
11    [{'start': 70, 'end': 93, 'url': 'https://t.co...
12    [{'start': 113, 'end': 136, 'url': 'https://t....
13                                                  NaN
14    [{'start': 76, 'end': 99, 'url': 'https://t.co...
15    [{'start': 68, 'end': 91, 'url': 'https://t.co...
16    [{'start': 68, 'end': 91, 'url': 'https://t.co...
17    [{'start': 110, 'end': 133, 'url': 'https:

In [72]:
# inspecting column 'geo.coordinates.type' (only looking at first 5 entries)

print(df['geo.coordinates.type'][0:5])

0      NaN
1      NaN
2    Point
3      NaN
4      NaN
Name: geo.coordinates.type, dtype: object


In [73]:
# inspecting column 'attachments.media_keys' (only looking at first 5 entries)

print(df['attachments.media_keys'][0:5])

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: attachments.media_keys, dtype: object


In [74]:
# inspecting column 'context_annotations' (only looking at first 5 entries)

print(df['context_annotations'][0:5])

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: context_annotations, dtype: object


In [75]:
# inspecting column 'attachments.poll_ids' (only looking at first 5 entries)

print(df['attachments.poll_ids'][0:5])

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: attachments.poll_ids, dtype: object


In [76]:
# inspecting column 'entities.cashtags' (only looking at first 5 entries)

print(df['entities.cashtags'][0:5])

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: entities.cashtags, dtype: object


In [77]:
# inspecting column 'entities.annotations' (only looking at first 5 entries)

print(df['entities.annotations'][0:5])

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: entities.annotations, dtype: object


### 7.2. Dropping columns

Unnecessary columns include:
* referenced_tweets
* id
* conversation_id
* edit_history_tweet_ids
* possibly_sensitive
* reply_settings
* edit_controls.edits_remaining
* edit_controls.is_edit_eligible
* edit_controls.editable_until
* entities.mentions
* entities.urls
* geo.coordinates.type
* attachments.media_keys
* context_annotations
* attachments.poll_ids
* entities.cashtags
* entities.annotations
* withheld.copyright
* withheld.country_codes
* withheld.scope

**NOTE**: Some remaining columns may not end up being necessary but will be left in the dataframe just in case

In [78]:
# dropping the columns

df.drop(['referenced_tweets', 'id', 'conversation_id', 'edit_history_tweet_ids', 'possibly_sensitive', 'reply_settings', 'edit_controls.edits_remaining', 'edit_controls.is_edit_eligible', 'edit_controls.editable_until', 'entities.mentions', 'entities.urls', 'geo.coordinates.type', 'attachments.media_keys', 'context_annotations', 'attachments.poll_ids', 'entities.cashtags', 'entities.annotations', 'withheld.copyright', 'withheld.country_codes', 'withheld.scope'], axis=1, inplace=True)


In [80]:
# viewing all column names

df.columns

Index(['lang', 'created_at', 'author_id', 'in_reply_to_user_id', 'text',
       'geo.place_id', 'entities.hashtags', 'public_metrics.retweet_count',
       'public_metrics.reply_count', 'public_metrics.like_count',
       'public_metrics.quote_count', 'public_metrics.impression_count',
       'geo.coordinates.coordinates', 'new_created_at', 'week', 'month',
       'year', 'year-week', 'year-month', 'Language', 'date', 'inflow'],
      dtype='object')

In [79]:
# viewing the dataframe

df

Unnamed: 0,lang,created_at,author_id,in_reply_to_user_id,text,geo.place_id,entities.hashtags,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,...,geo.coordinates.coordinates,new_created_at,week,month,year,year-week,year-month,Language,date,inflow
0,de,2016-04-20 23:04:40+00:00,14526045,41482148,"@FrauWeh Film gesehen und nur gestaunt. Wir, a...",e11a8b8e3771f9fa,"[{'start': 126, 'end': 131, 'tag': 'OMFG'}]",0,1.0,0.0,...,,2016-04-20 23:04:40,16.0,4.0,2016.0,2016-16,2016-04,German,NaT,Syrians
1,de,2016-04-20 22:55:08+00:00,4122038069,,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",06d9a7c249c59bcd,,0,0.0,0.0,...,,2016-04-20 22:55:08,16.0,4.0,2016.0,2016-16,2016-04,German,NaT,Syrians
2,de,2016-04-20 21:27:37+00:00,1179543852,,"Habe schon lang nicht gehört, daß Flüchtling G...",e99b714fe65be4fb,,0,0.0,0.0,...,"[7.1468836, 50.7306348]",2016-04-20 21:27:37,16.0,4.0,2016.0,2016-16,2016-04,German,NaT,Syrians
3,de,2016-04-20 21:18:58+00:00,224607633,,"""Es kommen kaum noch Flüchtlinge nach Griechen...",3078869807f9dd36,,0,0.0,0.0,...,,2016-04-20 21:18:58,16.0,4.0,2016.0,2016-16,2016-04,German,NaT,Syrians
4,de,2016-04-20 20:56:48+00:00,3022904603,,"Verständlich, aber #Frankreich muss eigene Feh...",48504653e183c91c,"[{'start': 19, 'end': 30, 'tag': 'Frankreich'}...",0,0.0,0.0,...,,2016-04-20 20:56:48,16.0,4.0,2016.0,2016-16,2016-04,German,NaT,Syrians
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68316,en,2021-06-25 10:34:05+00:00,232958476,,"The ministry of immigration , runs the biggest...",37439688c6302728,,0.0,0.0,0.0,...,,NaT,25.0,6.0,2021.0,2021-25,2021-06,English,2021-06-25 00:00:00+00:00,Ukrainians
68317,en,2021-06-24 19:29:39+00:00,9474872,9474872.0,"@sudo_f @typo3 @felicity_brand Intellectually,...",8abc99434d4f5d28,,0.0,1.0,3.0,...,,NaT,25.0,6.0,2021.0,2021-25,2021-06,English,2021-06-24 00:00:00+00:00,Ukrainians
68318,en,2021-06-24 18:33:38+00:00,980714168,2199678761.0,@Waringphilip Agree. Immigration has done me p...,c82d9e53ae03d753,,0.0,0.0,1.0,...,,NaT,25.0,6.0,2021.0,2021-25,2021-06,English,2021-06-24 00:00:00+00:00,Ukrainians
68319,en,2021-06-24 11:16:36+00:00,185889479,10809412.0,@rakyll I would love to have automatic cross z...,5bcd72da50f0ee77,,0.0,0.0,0.0,...,,NaT,25.0,6.0,2021.0,2021-25,2021-06,English,2021-06-24 00:00:00+00:00,Ukrainians


## 8. Saving the data

In [82]:
df.to_csv(CASS_thesis / "02_Pre-processed_merged.csv")