# 02. Twitter Text Cleaning

---

<h1>Table of Contents<span class="tocSkip"></span></h1>

- [1. Import Packages](#1.-Import-Packages)
- [2. Read in Twitter Data](#2.-Read-in-Twitter-Data)
- [3. Frame Cleaning](#3.-Frame-Cleaning)
- [4. Text Cleaning](#3.-Text-Cleaning)
 - [4A. Initial HTML Cleaning](#4A.-Initial-HTML-Cleaning)
 - [4B. Links Cleaning](#4B.-Links-Cleaning)
 - [4C. Additional HTML and Other Text Cleaning](#4C.-Additional-HTML-and-Other-Text-Cleaning)
- [5. Read to CSV](#5.-Read-to-CSV)

---

# 1.  Import Packages

In [1]:
import pandas as pd
import numpy as np
import requests
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.tokenize import RegexpTokenizer
import regex as re
from nltk.stem.porter import PorterStemmer
from bs4 import BeautifulSoup  
from sklearn.feature_extraction import stop_words
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Fausto\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Fausto\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


---

# 2. Read in Twitter Data

You can read in a csv file containing scraped twitter data that we will then run through to clean.  Please import the file path into string object that currently contains "./data/test.csv".

In [2]:
data = pd.read_csv('../datasets/scraped_tweets.csv')

In [3]:
data.head()

Unnamed: 0.1,Unnamed: 0,username,tweet,date_posted
0,0,511njbt,Delays on George Washington Bridge westbound f...,2019-11-06 23:59:56+00:00
1,1,511njbt,Delays on George Washington Bridge westbound f...,2019-11-06 23:58:57+00:00
2,2,511njtpk,Crash on New Jersey Turnpike - Eastern Spur so...,2019-11-06 23:58:56+00:00
3,3,511nji295,Crash on I-295 southbound South of Exit 29 - U...,2019-11-06 23:56:56+00:00
4,4,511njace,"Construction, bridge painting on Atlantic City...",2019-11-06 23:52:57+00:00


---

# 3. Frame Cleaning

In [4]:
data.drop(columns = 'Unnamed: 0', inplace = True)

In [5]:
data.head()

Unnamed: 0,username,tweet,date_posted
0,511njbt,Delays on George Washington Bridge westbound f...,2019-11-06 23:59:56+00:00
1,511njbt,Delays on George Washington Bridge westbound f...,2019-11-06 23:58:57+00:00
2,511njtpk,Crash on New Jersey Turnpike - Eastern Spur so...,2019-11-06 23:58:56+00:00
3,511nji295,Crash on I-295 southbound South of Exit 29 - U...,2019-11-06 23:56:56+00:00
4,511njace,"Construction, bridge painting on Atlantic City...",2019-11-06 23:52:57+00:00


In [6]:
data.dtypes

username       object
tweet          object
date_posted    object
dtype: object

In [7]:
data['date_posted'] = pd.to_datetime(data['date_posted'])

In [8]:
data.dtypes

username                    object
tweet                       object
date_posted    datetime64[ns, UTC]
dtype: object

---

# 4. Text Cleaning

## 4A. Initial HTML Cleaning

First off, we want to run a simiple Beautiful Soup object that begins to clear out some of the legacy HTML characters.

In [9]:
clean_text = [BeautifulSoup(data.loc[row,'tweet'],'lxml').text for row in list(range(data.shape[0]))]
data['tweet'] = clean_text

In [10]:
data.head()

Unnamed: 0,username,tweet,date_posted
0,511njbt,Delays on George Washington Bridge westbound f...,2019-11-06 23:59:56+00:00
1,511njbt,Delays on George Washington Bridge westbound f...,2019-11-06 23:58:57+00:00
2,511njtpk,Crash on New Jersey Turnpike - Eastern Spur so...,2019-11-06 23:58:56+00:00
3,511nji295,Crash on I-295 southbound South of Exit 29 - U...,2019-11-06 23:56:56+00:00
4,511njace,"Construction, bridge painting on Atlantic City...",2019-11-06 23:52:57+00:00


## 4B. Links Cleaning

Next, we'll work to remove all links from the text.

In [11]:
# the below regex code replaces the links that being in http in our text
data['tweet'] = data.loc[:, 'tweet'].map(lambda row : re.sub(r'http\S+', '', row))
data['tweet'] = data.loc[:, 'tweet'].map(lambda row : re.sub(r'@\S+', '', row))

## 4C. Additional HTML  and Other Text Cleaning

We found that while our Beautiful Soup object was pretty helpful in providing an initial clean to the HTML characters, we discover in looking at our text items that there are still some characters left over.  Although some could be removed just with blanks, we will actually replace these with spaces so that original formatting is maintained.

In [12]:
# list of leftover HTML characters to clean out with blanks
rem_chars = ['\n', '\ufeff', '>', '**', '\'ve', '#', '…']

In [13]:
for char in rem_chars:
    data['tweet'] = data.loc[:, 'tweet'].map(lambda text : text.replace(char, ' '))

---

# 5. Read to CSV

In [14]:
data.to_csv('../datasets/clean_twitter.csv')

---