# Data Wrangling 

This is for the second capstone project. The project goal is to predict whether tweets are discussing an actual disaster based on the text -- it is from this kaggle challenge: https://www.kaggle.com/c/nlp-getting-started/overview

The datases are provided, but need cleaning. 

In [1]:
#imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [16]:
#raw data 
test_data = pd.read_csv('C:/Users/bourg/OneDrive/Documents/springboard/NLP getting started/test.csv')
train_data = pd.read_csv('C:/Users/bourg/OneDrive/Documents/springboard/NLP getting started/train.csv')

In [17]:
#first compare the two
train_data.info()
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        3263 non-null   int64 
 1   keyword   3237 non-null   object
 2   location  2158 non-null   object
 3   text      3263 non-null   object
dtypes: int64(1), object(3)
memory usage: 102.1+ KB


I was already aware that train_data had an extra column, target -- this is for the judgments on whether or not the text/tweet is about a disaster or not (it will contain a 1 or a 0). The training data contains this info, and the test data does not yet have it. 

The 'id' column is an accurate representation for how many rows there actually are -- which means both have quite a lot of location data missing and some keywords missing. I am not sure at this point how important those datapoints are. 

In [18]:
train_data.head()
print(train_data.columns)


Index(['id', 'keyword', 'location', 'text', 'target'], dtype='object')


In [19]:
train_data.describe()



Unnamed: 0,id,target
count,7613.0,7613.0
mean,5441.934848,0.42966
std,3137.11609,0.49506
min,1.0,0.0
25%,2734.0,0.0
50%,5408.0,0.0
75%,8146.0,1.0
max,10873.0,1.0


In [20]:
train_data.dtypes

id           int64
keyword     object
location    object
text        object
target       int64
dtype: object

In [21]:
train_data.isnull().sum()


id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

In [22]:
missing_locs = train_data['location'].isna().sum()
missing_key = train_data['keyword'].isna().sum()
loc_na = missing_locs/7613
key_na = missing_key/7613
print(loc_na)
print(key_na)

0.33272034677525286
0.008012610009194798


Over 33% of the location data is missing; less than one percent of the keywords are missing. At this point, the spots need to be either filled in with place holder data, or dropped. 

I will drop those rows with missing keywords, because there are so few. I will fill the location data with NaN values and try to proceed from there, editing if need be. I want to avoid dropping those rows if possible because it's such a large chunk of the data, but for the same reason, I don't want to extrapolate data. 

# Dropping rows

In [29]:

train_data.dropna(subset=['keyword'], inplace=True)
train_data.info



<bound method DataFrame.info of          id  keyword                       location  \
31       48   ablaze                     Birmingham   
32       49   ablaze  Est. September 2012 - Bristol   
33       50   ablaze                         AFRICA   
34       52   ablaze               Philadelphia, PA   
35       53   ablaze                     London, UK   
...     ...      ...                            ...   
7578  10830  wrecked                            NaN   
7579  10831  wrecked              Vancouver, Canada   
7580  10832  wrecked                        London    
7581  10833  wrecked                        Lincoln   
7582  10834  wrecked                            NaN   

                                                   text  target  
31    @bbcmtd Wholesale Markets ablaze http://t.co/l...       1  
32    We always try to bring the heavy. #metal #RT h...       0  
33    #AFRICANBAZE: Breaking news:Nigeria flag set a...       1  
34                   Crying out for more! S

# Checking for duplicates


In [None]:
duplicates = train_data.duplicated()
#duplicates.to_csv(r'C:/Users/bourg/OneDrive/Documents/springboard/duplicates.csv')

#there are no duplicates
print(duplicates)

# Cleaning

In [None]:
#delete all non alphanumeric values
train_data['text'] = train_data['text'].replace(r'[^0-9a-zA-Z:,\s]', '', regex=True)

#strips extra whitespace and makes all lowercase by column
train_data['text'] = train_data['text'].str.strip().str.lower()


train_data['keyword'] = train_data['keyword'].str.strip().str.lower()
train_data['location'] = train_data['location'].str.strip().str.lower()

In [None]:
train_data['text']

# Stop words

I am considering removing stop words using NLTK; for now I am leaving them in, unless I can find a compelling reason not to. I don't yet know what my process will be moving forward. 

# Test file

Doing it all again, but with the test file. 

In [31]:
test_data.describe()

Unnamed: 0,id
count,3263.0
mean,5427.152927
std,3146.427221
min,0.0
25%,2683.0
50%,5500.0
75%,8176.0
max,10875.0


In [32]:
test_data.dtypes

id           int64
keyword     object
location    object
text        object
dtype: object

In [34]:
train_data.isnull().sum()


id             0
keyword        0
location    2472
text           0
target         0
dtype: int64

There are no missing keywords, so there is no need to drop rows right now. But It would be very difficult to drop empty location rows because they make up 82% of the test data. 

In [35]:
#cleaning
#delete all non alphanumeric values
test_data['text'] = test_data['text'].replace(r'[^0-9a-zA-Z:,\s]', '', regex=True)

#strips extra whitespace and makes all lowercase by column
test_data['text'] = test_data['text'].str.strip().str.lower()


test_data['keyword'] = test_data['keyword'].str.strip().str.lower()
test_data['location'] = test_data['location'].str.strip().str.lower()

test_data.head()

Unnamed: 0,id,keyword,location,text
0,0,,,just happened a terrible car crash
1,2,,,"heard about earthquake is different cities, st..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,apocalypse lighting spokane wildfires
4,11,,,typhoon soudelor kills 28 in china and taiwan


# Export files

In [36]:
test_data.to_csv(r'C:/Users/bourg/OneDrive/Documents/springboard/NLP getting started/cleaned/test_data_clean.csv')
train_data.to_csv(r'C:/Users/bourg/OneDrive/Documents/springboard/NLP getting started/cleaned/train_data_clean.csv')