## No SDG archive merger
This code is meant to merge the data files that contain tweets sampled from 6 random datetimes since 2016 during 5 minutes each. The tweets have been collected under the condition of not matching any SDG-related tag and being in english. The datetimes are the following (YYYY-MM-DD hh:mm):\
2016-04-06 00:00\
2017-07-23 14:00\
2018-09-30 21:00\
2019-11-02 07:00\
2020-02-10 18:00\
2021-08-03 11:00

In [2]:
# IMPORTS
import pandas as pd
import csv

In [3]:
# read all five files and concatenate them all in one single dataframe
df_enAll = pd.concat([pd.read_csv('./Raw data/NoRelevant_en'+str(x+1)+'.csv') for x in range(6)])

In [5]:
df_enAll = df_enAll[['id','created_at','text']] # Pick only the attributes with information
df_enAll.info()
display(df_enAll.head())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 845071 entries, 0 to 147485
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   id          845071 non-null  int64 
 1   created_at  845071 non-null  object
 2   text        845071 non-null  object
dtypes: int64(1), object(2)
memory usage: 25.8+ MB


Unnamed: 0,id,created_at,text
0,717503333775319040,2016-04-06T00:04:59.000Z,@mydickisbae finally üëèüëèüòÇ
1,717503333758468097,2016-04-06T00:04:59.000Z,@TheTyee Thanks!
2,717503333758447616,2016-04-06T00:04:59.000Z,Smokin this blunt so ion do no stupid shit
3,717503333758423041,2016-04-06T00:04:59.000Z,@__clovely hml we need to burn it soon üíï
4,717503333754273792,2016-04-06T00:04:59.000Z,Dasar attention seeker. üëäüëäüëä


In [7]:
# Check that there are no duplicates (All tweet id should be unique)
print(df_enAll['id'].nunique()) # unique tweet ids
print(len(df_enAll)) #total number of rows

843607
845071


In [9]:
# Remove duplicates using tweet id
df_enAll.drop_duplicates(subset=['id'], inplace=True)
print(df_enAll['id'].nunique())
print(len(df_enAll))

843607
843607


In [10]:
#Sort by date and reset index
df_enAll = df_enAll.sort_values(by=['created_at'])
df_enAll.reset_index(drop=True, inplace=True)
display(df_enAll.head())
display(df_enAll.tail())

Unnamed: 0,id,created_at,text
0,717502078546014208,2016-04-06T00:00:00.000Z,"Ayy lmao HE CAN HANDSTAND, WHEN HE HAS NO GRAC..."
1,717502078327853056,2016-04-06T00:00:00.000Z,.@RchrdAlln suggests Kingston Mills Locks. Wev...
2,717502078331985920,2016-04-06T00:00:00.000Z,@likeold2's account is temporarily unavailable...
3,717502078336180224,2016-04-06T00:00:00.000Z,"Wind 3.0 mph SSE. Barometer 30.237 in, Falling..."
4,717502078340431872,2016-04-06T00:00:00.000Z,I've completed the daily quest in Paradise Isl...


Unnamed: 0,id,created_at,text
843602,1422513823999614979,2021-08-03T11:04:59.000Z,@Sentry023 Haha I know and I like Rubio a lot....
843603,1422513823978639362,2021-08-03T11:04:59.000Z,@troytrade To the moon
843604,1422513823966277632,2021-08-03T11:04:59.000Z,@PinsTrading REST
843605,1422513823966171136,2021-08-03T11:04:59.000Z,I recommend that all scientists and academics ...
843606,1422513823731359754,2021-08-03T11:04:59.000Z,"new morning routine is wake up, queue for spli..."


In [11]:
# Save the merged dataset
df_enAll.to_csv('NoRelevant_enAll.csv', index=False)