# First Step: Importing Tweets stored as JSON file.

In [1]:
import pandas as pd 
import glob
import os

## Reading all the `JSON` files from the folder. 
- providing the folder path
- creating and empyty list with a name `filelist`.
- glob. glob creates a string value of path for all files in the folder and populates the `filelist`.
- glob.iglob could also be used with `(os.path.join)` method. Example, `filelist = glob.iglob(os.path.join(raw_data_path, "*.json"))`.

In [4]:
raw_data_path = input("Enter the path for raw data file: ")
filelist = []
filelist = glob.glob(raw_data_path + "/" + "*.json")
print(filelist)

Enter the path for raw data file:  /media/ambijat/Datrat/ipytoneeee/CNUS


['/media/ambijat/Datrat/ipytoneeee/CNUS/CHIIND03Jun2020_2224.json']


##### Above varialbe `filelist` is a list output with string of `json` file path. Each of the item in list are iterable for next stage of processing.

### Defining path to export the processed tweets into `CSV` file.
- `os.path.join` combines the folder path with the new file named `tweets.csv`.
- the newfile `tweets.csv` will contain all dressed up tweets ready for further analysis.
- this is the basic file for all future operations.

### String address for the `tweets.csv` is created below.

In [5]:
print(raw_data_path)
store_file = os.path.join(raw_data_path, 'tweets.csv')
print(store_file)

/media/ambijat/Datrat/ipytoneeee/CNUS
/media/ambijat/Datrat/ipytoneeee/CNUS/tweets.csv


## 2. Processing of `JSON` file with the help of `BASH` shell.
- 'Bash` is a terrific tool for handling files that run are Gbs in size.
- it is important to correctly reference variable use between python and `bash` shell.
- use of `$` sign is important to create reference for variable in bash use.

### Using json query `jq` for stripping the tweet that has many header keys.
- There is a strong reason for introducing `bash` here.
- I can work on even 8-10 Gigabytes of file size which is very difficult under python.
- also the export of tweet heads are smooth and quick.

### `$store_file` is the path for exporting all the tweets.

##### Remember the `$store_file`, which takes all the data. I use `!rm` to remove any previously present file because of `>>` that is used to append the tweets.

In [6]:
!rm $store_file

for file in filelist:
    ! echo $file
    ! jq -r '[.user.screen_name, .retweeted_status.user.screen_name, .full_text, .display_text_range[1], .created_at, .id, .in_reply_to_user_id, .user.location] | @csv' < $file >> $store_file

rm: cannot remove '/media/ambijat/Datrat/ipytoneeee/CNUS/tweets.csv': No such file or directory
/media/ambijat/Datrat/ipytoneeee/CNUS/CHIIND03Jun2020_2224.json


### Importing the Dressed Up Tweet data.
- Defining the column names for placing the imported tweet.
- declaring the `DataFrame`.
- Reading the data with names given by `column_names`.
#### retrieving and visualising the tweets.

In [7]:
column_names = [ "user", "reuser", "full_text", "range", "created_at", "id", "in_reply_to_user_id", "user.location"]
tweets = pd.DataFrame
tweets = pd.read_csv(store_file, header = None, names = column_names, low_memory = False)

##### A successful import of tweets can be inspected by using the `head()` function.

In [8]:
tweets.head()

Unnamed: 0,user,reuser,full_text,range,created_at,id,in_reply_to_user_id,user.location
0,TomWong93767868,SkyNews,RT @SkyNews: The PM has pledged to allow nearl...,139,Wed Jun 03 16:54:54 +0000 2020,1268224655061479400,,
1,98h33654581,HongKongFP,RT @HongKongFP: [Recap] Pompeo says China is t...,135,Wed Jun 03 16:54:54 +0000 2020,1268224652993630200,,
2,tckj725,Reuters,RT @Reuters: HSBC says it supports China's sec...,117,Wed Jun 03 16:54:53 +0000 2020,1268224650112200700,,Australia、Hong kong
3,kizza_marxel,nytimes,RT @nytimes: Breaking News: Prime Minister Bor...,140,Wed Jun 03 16:54:51 +0000 2020,1268224642839240700,,
4,PilotAbilene,Steinhoefel,"RT @Steinhoefel: Einwanderung, aber richtig. I...",140,Wed Jun 03 16:54:51 +0000 2020,1268224642390593500,,Shermany


### Removal of duplicate tweets.
- counting the number of tweets by their `id` and using `unique` function to deduce the number of duplicate tweets.
- each tweet has unique `id`, which can used to find duplicate tweets. If `id` is more than once.

In [9]:
print(len(tweets))
print(len(tweets.id.unique()))
print(len(tweets) - len(tweets.id.unique()))

116018
116017
1


### New dataframe `tweets2` created after dropping duplicates.
### drop duplicate tweets by tweet `id` as each tweet has its unique token under column `id`.
- number of duplicate tweets should be 0, before further processing.
- using the `drop_duplicates()` function.

In [10]:
tweets2 = tweets.drop_duplicates(subset='id', keep="first").copy()

#### rexamining the removal of the duplicate tweets. The score is equal to the previous cell result.

In [11]:
len(tweets2)

116017

## 3. Extraction of Tweets Datewise.
#### The function `to_datetime` is used to extract date element from column `created_at`.
##### This is used for further numeric operations and filtering of the tweets based on dates.
#### segregating tweets2 based on dates, step 1.

In [12]:
tweets2['date'] = pd.to_datetime(tweets2['created_at'], errors='coerce')

In [13]:
tweets2.dtypes

user                                object
reuser                              object
full_text                           object
range                                int64
created_at                          object
id                                   int64
in_reply_to_user_id                float64
user.location                       object
date                   datetime64[ns, UTC]
dtype: object

##### The use of `errors='coerce'` shall give output as `NA` in dates, where thereis non-conformity.
##### The `Date` column has produced dates in `yy-m-d h-m-s` format.

In [14]:
tweets2.head()

Unnamed: 0,user,reuser,full_text,range,created_at,id,in_reply_to_user_id,user.location,date
0,TomWong93767868,SkyNews,RT @SkyNews: The PM has pledged to allow nearl...,139,Wed Jun 03 16:54:54 +0000 2020,1268224655061479400,,,2020-06-03 16:54:54+00:00
1,98h33654581,HongKongFP,RT @HongKongFP: [Recap] Pompeo says China is t...,135,Wed Jun 03 16:54:54 +0000 2020,1268224652993630200,,,2020-06-03 16:54:54+00:00
2,tckj725,Reuters,RT @Reuters: HSBC says it supports China's sec...,117,Wed Jun 03 16:54:53 +0000 2020,1268224650112200700,,Australia、Hong kong,2020-06-03 16:54:53+00:00
3,kizza_marxel,nytimes,RT @nytimes: Breaking News: Prime Minister Bor...,140,Wed Jun 03 16:54:51 +0000 2020,1268224642839240700,,,2020-06-03 16:54:51+00:00
4,PilotAbilene,Steinhoefel,"RT @Steinhoefel: Einwanderung, aber richtig. I...",140,Wed Jun 03 16:54:51 +0000 2020,1268224642390593500,,Shermany,2020-06-03 16:54:51+00:00


##### We shall strip only date and drop the hours, minutes and seconds component from the date.
##### The function `date.dt.strftime('%d%m%Y')` is used to strip out the same.

## segregating tweets2 based on dates, step 2.
#### a new column `date2` is created by stripping only day, month and year from column `date`.

In [15]:
tweets2['date2'] = tweets2.date.dt.strftime('%d%m%Y')

In [16]:
tweets2.head()

Unnamed: 0,user,reuser,full_text,range,created_at,id,in_reply_to_user_id,user.location,date,date2
0,TomWong93767868,SkyNews,RT @SkyNews: The PM has pledged to allow nearl...,139,Wed Jun 03 16:54:54 +0000 2020,1268224655061479400,,,2020-06-03 16:54:54+00:00,3062020
1,98h33654581,HongKongFP,RT @HongKongFP: [Recap] Pompeo says China is t...,135,Wed Jun 03 16:54:54 +0000 2020,1268224652993630200,,,2020-06-03 16:54:54+00:00,3062020
2,tckj725,Reuters,RT @Reuters: HSBC says it supports China's sec...,117,Wed Jun 03 16:54:53 +0000 2020,1268224650112200700,,Australia、Hong kong,2020-06-03 16:54:53+00:00,3062020
3,kizza_marxel,nytimes,RT @nytimes: Breaking News: Prime Minister Bor...,140,Wed Jun 03 16:54:51 +0000 2020,1268224642839240700,,,2020-06-03 16:54:51+00:00,3062020
4,PilotAbilene,Steinhoefel,"RT @Steinhoefel: Einwanderung, aber richtig. I...",140,Wed Jun 03 16:54:51 +0000 2020,1268224642390593500,,Shermany,2020-06-03 16:54:51+00:00,3062020


#### A list of all the dates gather by `unique()` function is collected.
##### collecting array of dates.

In [17]:
collect = tweets2.date2.unique()
print(collect)
type(collect)

['03062020' '02062020' '01062020']


numpy.ndarray

#### making folder for storing tweets datewise.
'''
Here the essential difference is that apart from string variable `process_path` storing the value of the path, the actual folder creation is also being done in the next line. The function `os.path-exists()` checks whether there is folder named `datewise`, then if there no such folder then function `os.makedirs()` creates one.
'''

In [18]:
datewise_path = raw_data_path + "/" + 'datewise'
print(datewise_path)
if not os.path.exists(datewise_path):
    os.makedirs(datewise_path)

/media/ambijat/Datrat/ipytoneeee/CNUS/datewise


### Exporting the tweets to datewise `csv` files.
- declaring `variables = locals()`.
- y is just to get the list of`csv` files.
- use of `sep = ';'` is important.
- filename and path are supplied through use of variable.
- in order to set the location of storage folder, the function`os.path.join(process_path, r + '.csv')` is used.

In [19]:
variables = locals()
y = []
for key in collect:
    r = "tweet{0}".format(key)
    variables[r]= tweets2.loc[tweets2['date2'] == key].copy()
    variables[r].to_csv(os.path.join(datewise_path, r + '.csv'), index = False, header = True, encoding = 'utf-8', sep = ';')
    y.append(r)
print(y)

['tweet03062020', 'tweet02062020', 'tweet01062020']


## Re-examine the above exercise by reimporting the tweets from multiple files.
#### concat all files as means only to test the previous range of process.

In [20]:
all_files = glob.iglob(os.path.join(datewise_path, "*.csv"))
tweets = pd.concat((pd.read_csv(f, lineterminator='\n', sep = ';') for f in all_files))

##### Examining the length, which is the same as previous score.

In [21]:
print(len(tweets))

116017


#### Inspecting the tweets by using `head()` function.

In [22]:
tweets.head()

Unnamed: 0,user,reuser,full_text,range,created_at,id,in_reply_to_user_id,user.location,date,date2
0,namjoonayo,cannacae,RT @cannacae: i know BLM is important rn but i...,140,Mon Jun 01 23:59:57 +0000 2020,1267606846824542200,,,2020-06-01 23:59:57+00:00,1062020
1,ngdasokau,BarrySheerman,RT @BarrySheerman: Will the Johnson Government...,140,Mon Jun 01 23:59:57 +0000 2020,1267606845763367000,,Hong Kong,2020-06-01 23:59:57+00:00,1062020
2,geoffreypcdog,benedictrogers,RT @benedictrogers: Seven former foreign secre...,140,Mon Jun 01 23:59:57 +0000 2020,1267606843322331100,,,2020-06-01 23:59:57+00:00,1062020
3,Marisa15061974,WDVL4,RT @WDVL4: China pidió a empresas estatales qu...,140,Mon Jun 01 23:59:56 +0000 2020,1267606839547625500,,,2020-06-01 23:59:56+00:00,1062020
4,Tony13945454,nytimesworld,RT @nytimesworld: The gathering to remember Ti...,140,Mon Jun 01 23:59:55 +0000 2020,1267606835428593700,,Hong Kong,2020-06-01 23:59:55+00:00,1062020


#### Eversince the twitter has increased the character limit from 140 to 280, the `full_text` gives the text instead of `text`.
#### One can ensure where the tweet text lies, by checking though `isna().sum()` function.
##### checking for any null tweets in `full_text`.

In [23]:
tweets['full_text'].isna().sum()

0

### Exporting `full_text` tweet for further analysis.
##### Using the input to extract tweets txt from `full_text` and storing in a list.

In [28]:
tweetstext = tweets.iloc[:,[3, 2]]
tweetstext.head()

Unnamed: 0,range,full_text
0,140,RT @cannacae: i know BLM is important rn but i...
1,140,RT @BarrySheerman: Will the Johnson Government...
2,140,RT @benedictrogers: Seven former foreign secre...
3,140,RT @WDVL4: China pidió a empresas estatales qu...
4,140,RT @nytimesworld: The gathering to remember Ti...


In [29]:
twitext2 = tweetstext.sort_values('range',na_position='first')
twitext2 = twitext2.reset_index(drop=True)
tt = twitext2.iloc[:,[1]]

#### Inspecting the tweet text.

In [30]:
tt.head()

Unnamed: 0,full_text
0,China \nHong Kong https://t.co/hxgmLoEa5T
1,Hong Kong- China https://t.co/awi1qpJJ9I
2,"Hong Kong, China."
3,Hong Kong VS China https://t.co/zh2NWxhZk2
4,CHINA vs HONG KONG


#### Further cross validation of data.

In [31]:
tt.isna().sum()
len(tt)

116017

## Dumping `full_text`.

##### They are exported to text file.
##### Folder `process_path` is created.
##### The raw tweet text is being sent to file `process_raw.txt`.
##### The sorted tweet text will be sent to `process_type1.txt`.

In [27]:
process_path = raw_data_path + "/" + 'processed'

if not os.path.exists(process_path):
    os.makedirs(process_path)
    
store_path = os.path.join(process_path, "process_raw.txt")

tt.to_csv(store_path, header = None, index = None, sep='\n', mode='a')

store_file = os.path.join(process_path, 'process_type1.txt')
print(store_file)

/media/ambijat/Datrat/ipytoneeee/CNUS/processed/process_type1.txt


## 4. `full_text` processing by string manipulation and EDA (Exploratory Data Analysis).
#### Using `awk`, `sort` and `cut` command is used sort tweets according to length.

### Important: Next 2 cells take bash command but any hash comments simply give error into the cell. Do not insert any `#` line above `%%bash` as it will give an error of execution.
##### checking the file path variable in `bash`.

In [32]:
%%bash -s "$store_path" "$store_file"
echo a = $1, b = $2

a = /media/ambijat/Datrat/ipytoneeee/CNUS/processed/process_raw.txt, b = /media/ambijat/Datrat/ipytoneeee/CNUS/processed/process_type1.txt


#### the `awk '{print length, $0}'` command prints the length of each line.
#### Pay attention to `cut` command in `bash`, here `-d` is the delimiter of space and `-f` gives the length to be cut from the tweet text. The term `2-` means from second till last instance of delimiter in line.
#### The `awk` command in the end is typical expression `awk '!a[$0]++'`, which is used to remove duplicate lines.
#### Sending the sorted tweet text to `process_type1.txt`.

In [33]:
%%bash -s "$store_path" "$store_file"
echo a = $1, b = $2
cat $1 | awk '{ print length, $0 }' | sort -n -s | cut -d ' ' -f 2- | awk '!a[$0]++' > $2

a = /media/ambijat/Datrat/ipytoneeee/CNUS/processed/process_raw.txt, b = /media/ambijat/Datrat/ipytoneeee/CNUS/processed/process_type1.txt


### Making files for next stage processing.

In [34]:
store_file_a = os.path.join(process_path, 'process_type3.txt')
print(store_file_a)
store_file_b = os.path.join(process_path, 'process_type3b.txt')
print(store_file_b)

/media/ambijat/Datrat/ipytoneeee/CNUS/processed/process_type3.txt
/media/ambijat/Datrat/ipytoneeee/CNUS/processed/process_type3b.txt


## 5. Extracting `URLS` from tweet text.
#### extraction of urls from the full_text. This is a 2 step process. a) all tweets are broken into word strings and then b) all strings of 22-25 character long are collected. This is on presumption that tweet urls are 22-25 char long.
##### The file `process_type3b.txt` shall gather all the strings of length 22-25 character. This is the standard twitter `url` length.
### `sed` substitution of various types of strings is based on experience and sampling of text.

In [35]:
%%bash -s "$store_file" "$store_file_a" "$store_file_b"
echo $1
echo $2
echo $3
cat $1 | sed 's/[[:space:]]/\n/g' | sed 's/^[ \t]*//' | sed 's/[ \t]*$//' | sed 's/^["]*//g' |\
sed 's/["]$//g'| sed '/^$/d'| sed 's/["]//g' | awk '{ if (length($0) > 22) print }' |\
awk '{ if (length($0) < 25) print }' >> $2

/media/ambijat/Datrat/ipytoneeee/CNUS/processed/process_type1.txt
/media/ambijat/Datrat/ipytoneeee/CNUS/processed/process_type3.txt
/media/ambijat/Datrat/ipytoneeee/CNUS/processed/process_type3b.txt


#### The filtered length in above file then shall be looked for those strings that start with `http` and same shall be exported to `process_type3b.txt` file.
#### After filtering by character length. The processing done according to `http` identification.

In [36]:
%%bash -s "$store_file" "$store_file_a" "$store_file_b"
echo $1
echo $2
echo $3
cat $2 | sed 's/["]*$//g' | awk '/^http/{print}' | awk '{ if (length($0) > 22) print }' |\
awk '{ if (length($0) < 24) print }'| sort -u | awk -v n=15 '1; NR % n == 0 {print ""}' >> $3

/media/ambijat/Datrat/ipytoneeee/CNUS/processed/process_type1.txt
/media/ambijat/Datrat/ipytoneeee/CNUS/processed/process_type3.txt
/media/ambijat/Datrat/ipytoneeee/CNUS/processed/process_type3b.txt


### to convert unix file to windows format

In [37]:
win_file = os.path.join(process_path, "win_file.txt")
print(win_file)

/media/ambijat/Datrat/ipytoneeee/CNUS/processed/win_file.txt


### for windows version the convervsion is done.

In [38]:
%%bash -s "$store_file_b" "$win_file"
echo $1
echo $2
cat $1 | awk 'sub("$", "\r")' >$2

/media/ambijat/Datrat/ipytoneeee/CNUS/processed/process_type3b.txt
/media/ambijat/Datrat/ipytoneeee/CNUS/processed/win_file.txt


#### Rearranging dates into `y-m-d` format in new column `date3`.
#### Extraction of dates by slicing method of string.

In [39]:
tweets2['date3'] = tweets2.date2.str[4:8]+tweets2.date2.str[2:4]+tweets2.date2.str[0:2]

#### Inspecting the `date3` in dataframe `tweets2`.

In [40]:
tweets2.head()

Unnamed: 0,user,reuser,full_text,range,created_at,id,in_reply_to_user_id,user.location,date,date2,date3
0,TomWong93767868,SkyNews,RT @SkyNews: The PM has pledged to allow nearl...,139,Wed Jun 03 16:54:54 +0000 2020,1268224655061479400,,,2020-06-03 16:54:54+00:00,3062020,20200603
1,98h33654581,HongKongFP,RT @HongKongFP: [Recap] Pompeo says China is t...,135,Wed Jun 03 16:54:54 +0000 2020,1268224652993630200,,,2020-06-03 16:54:54+00:00,3062020,20200603
2,tckj725,Reuters,RT @Reuters: HSBC says it supports China's sec...,117,Wed Jun 03 16:54:53 +0000 2020,1268224650112200700,,Australia、Hong kong,2020-06-03 16:54:53+00:00,3062020,20200603
3,kizza_marxel,nytimes,RT @nytimes: Breaking News: Prime Minister Bor...,140,Wed Jun 03 16:54:51 +0000 2020,1268224642839240700,,,2020-06-03 16:54:51+00:00,3062020,20200603
4,PilotAbilene,Steinhoefel,"RT @Steinhoefel: Einwanderung, aber richtig. I...",140,Wed Jun 03 16:54:51 +0000 2020,1268224642390593500,,Shermany,2020-06-03 16:54:51+00:00,3062020,20200603


#### Making new folder to store the `tweets2`.

In [41]:
process2_path = raw_data_path + "/" + 'processed2'
print(process2_path)
if not os.path.exists(process2_path):
    os.makedirs(process2_path)

/media/ambijat/Datrat/ipytoneeee/CNUS/processed2


#### Exporting the tweets.

In [42]:
tweets2.to_csv(os.path.join(process2_path, 'tweets2.csv'), index = False, header = True, encoding = 'utf-8', sep = ';')

### storing some dataframe to be used for another notebook.
'''
the tweets2 can be used for processing the data in R as all tweets and date3 can give everything for such processig. 
The process_type3b has all the urls. One can find the most tweeted urls in next stage.
'''

In [43]:
%store tweets2

Stored 'tweets2' (DataFrame)


In [44]:
%%bash -s "$store_path" "$store_file_a" "$store_file_b"
echo $1
echo $2
echo $3
rm $2
rm $3

/media/ambijat/Datrat/ipytoneeee/CNUS/processed/process_raw.txt
/media/ambijat/Datrat/ipytoneeee/CNUS/processed/process_type3.txt
/media/ambijat/Datrat/ipytoneeee/CNUS/processed/process_type3b.txt
