# 03. Pre-processing and EDA
---

The prurpose of this notebook is to prepare our web-scrapped data from the file [scraped_tweets.csv](#http://localhost:8888/edit/PROJECTS/project_5/project_5/datasets/scraped_tweets.csv) for future modeling.

---
# Table of Content #

- [1. Importing Libraries and Data](#1.-Importing-Libraries-and-Data)
- [2. Pre-processing](#2.-Pre-processing)
- [3. EDA](#3.-EDA)

---
## 1. Importing Libraries and Data ##

In [2]:
#Importing Libraries
import pandas as pd
import numpy as np

In [3]:
#Importing our data to a DataFrame
df = pd.read_csv('../datasets/clean_twitter.csv')

---
## 2. Pre-processing ##

In [4]:
#Getting data dimesions
df.shape

(8015, 4)

In [5]:
#Initial data check
df.head()

Unnamed: 0.1,Unnamed: 0,username,tweet,date_posted
0,0,511njbt,Delays on George Washington Bridge westbound f...,2019-11-06 23:59:56+00:00
1,1,511njbt,Delays on George Washington Bridge westbound f...,2019-11-06 23:58:57+00:00
2,2,511njtpk,Crash on New Jersey Turnpike - Eastern Spur so...,2019-11-06 23:58:56+00:00
3,3,511nji295,Crash on I-295 southbound South of Exit 29 - U...,2019-11-06 23:56:56+00:00
4,4,511njace,"Construction, bridge painting on Atlantic City...",2019-11-06 23:52:57+00:00


In order to determine what tweets are relevant to our analysis, we have conducted some code words filtering - see the details below.

In [6]:
#Creating individual masks to filter our DataFrame

mask_road = df['tweet'].str.contains('road', regex=False, case=False)
mask_exit = df['tweet'].str.contains('exit', regex=False, case=False)
mask_street = df['tweet'].str.contains('street', regex=False, case=False)
mask_highway = df['tweet'].str.contains('highway', regex=False, case=False)
mask_hwy = df['tweet'].str.contains('hwy', regex=False, case=False)
mask_ramp = df['tweet'].str.contains('ramp', regex=False, case=False)
mask_st = df['tweet'].str.contains('st', regex=False, case=False)
mask_ave = df['tweet'].str.contains('ave', regex=False, case=False)
mask_lane = df['tweet'].str.contains('lane', regex=False, case=False)
mask_ln = df['tweet'].str.contains('ln', regex=False, case=False)
mask_drive = df['tweet'].str.contains('drive', regex=False, case=False)
mask_clos = df['tweet'].str.contains('clos', regex=False, case=False)

#Creating a unified mask
mask = (mask_road |
                  mask_exit | 
                  mask_street | 
                  mask_highway | 
                  mask_hwy | 
                  mask_ramp | 
                  mask_st | 
                  mask_lane)& mask_clos

In [7]:
#Separating tweets dealing with actual road closures into a separate DataFrame
df_closures = df[mask]

In [8]:
#Separating tweets NOT dealing with actual road closures into a separate DataFrame
df_spam = df[~mask]

---
## 3. EDA ##

In [9]:
#Getting data dimensions for the closures DataFrame
df_closures.shape

(1622, 4)

In [10]:
#Checking for empty tweets
df_closures['tweet'].isna().sum()

0

In [11]:
#Getting data dimensions for the spam DataFrame
df_spam.shape

(6393, 4)

In [12]:
#Checking for empty tweets
df_spam['tweet'].isna().sum()

0

In [13]:
#Introducing a label for our future positive class
df_closures['mark'] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [14]:
#Introducing a label for our future negative class
df_spam['mark'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [15]:
#Having a look at our spam DataFrame
df_spam

Unnamed: 0.1,Unnamed: 0,username,tweet,date_posted,mark
0,0,511njbt,Delays on George Washington Bridge westbound f...,2019-11-06 23:59:56+00:00,1
1,1,511njbt,Delays on George Washington Bridge westbound f...,2019-11-06 23:58:57+00:00,1
2,2,511njtpk,Crash on New Jersey Turnpike - Eastern Spur so...,2019-11-06 23:58:56+00:00,1
5,5,511nji287,Delays on I-287 southbound between Exit 6 - CR...,2019-11-06 23:52:56+00:00,1
6,6,511nji287,Delays on I-287 northbound between Exit 37 - N...,2019-11-06 23:51:56+00:00,1
...,...,...,...,...,...
8008,8008,NJTRANSIT_SBUS,"Bus Route No. 402, the 10:29 pm arrival into B...",2019-10-24 01:05:33+00:00,1
8009,8009,NJTRANSIT,"Good evening, the 167t to Harrington Park will...",2019-10-24 01:02:41+00:00,1
8010,8010,NJTRANSIT,"Hi Rachel, thank you for alerting us. I have f...",2019-10-24 00:57:28+00:00,1
8012,8012,NJTRANSIT_NBUS,"Bus Route No. 70, the 9:30 pm arrival into NPS...",2019-10-24 00:29:57+00:00,1


In [16]:
#Having a look at our road closures tweets DataFrame
df_closures.head()

Unnamed: 0.1,Unnamed: 0,username,tweet,date_posted,mark
3,3,511nji295,Crash on I-295 southbound South of Exit 29 - U...,2019-11-06 23:56:56+00:00,0
4,4,511njace,"Construction, bridge painting on Atlantic City...",2019-11-06 23:52:57+00:00,0
8,8,511njtpk,Roadwork on New Jersey Turnpike inner roadway ...,2019-11-06 23:41:56+00:00,0
9,9,511nji76,Crash on I-76 eastbound at Exit 2 - I-676 (Cam...,2019-11-06 23:41:56+00:00,0
19,19,511njtpk,Roadwork on New Jersey Turnpike inner roadway ...,2019-11-06 23:27:56+00:00,0


In [17]:
#Dropping unnecessary and ininformative columns
df_closures = df_closures[['tweet','mark']]
df_spam = df_spam[['tweet','mark']]

#Resetting indices
df_closures.reset_index(inplace=True)
df_spam.reset_index(inplace=True)

In [18]:
#Getting an impression of a spam tweet
df_spam['tweet'][5]

'Delays on I-295 southbound from Exit 29 - US 30 (Barrington) to Exit 26S - NJ 42/I-76/I-676 (Bellmawr) delays due to volume   '

In [19]:
#Getting an impression of a closures tweet
df_closures['tweet'][5]

'Roadwork on New Jersey Turnpike inner roadway Northbound between Inner and Outer Roadway Merge (Mansfield Twp) and North of Interchange 14 - I-78/US 1&9 (Newark) all lanes closed until 5:00 A.M.   '

In order to avoid inbalanced clasees with our future classification model we need to address this problem by constructing our modeling dataset in a way our positive class (field 'mark' = 0, tweets dealing with closed road parts) is not dissoluted in our negative class (field 'mark' = 1, tweets dealing with no road closures/partical closures - the idea is that for emergency planning it's important if a first-responce vehicle can use at least a part of the road). Hence, we will randomly pull tweets out of our negative 'spam' DataFrame - and a number of these tweets will be equal to the number of tweets in our positive DataFrame. 

In [20]:
#Randomly sampling our negative tweets into a new 'Negative' DataFrame
df_neg = df_spam.sample(df_closures.shape[0])

#Concatenating our positive and negative dataframes into the final one
df_final = pd.concat([df_closures, df_neg])

#Checking the final DataFrame dimensions
df_final.shape

(3244, 3)

In [21]:
#Saving our final DataFrame into a csv file for future modeling
df_final.to_csv('../datasets/modeling_df.csv')

---