# Text Preprocessing
## CSSM530: Automated Text Processing for Social Sciences
### Table of Contents:

1. [Text Preprocessing](#text_preproc)
    - [Removing punctuation](#remove_punc)
    - [Removing URLs](#remove_url)
    - [Lower Casing](#lower)
    - [Tokenization](#tokenize)
    - [Removing Stop Words](#remove_stop)
    - [Removing Emoji](#remove_emoji)
    - [Stemming & Lemmatization](#stem_lemma)
2. [Converting Text to Numbers (Text Vectorization)](#vectorization)
    - [Bag-of-Words](#bow)
    - [Term Frequency–Inverse Document Frequency (TF-IDF)](#tfidf)
    - [Building a Simple Classifier](#classifier)

## 0- Discussion

- Preprocessing $\neq$ removing
    - In most cases, you may want to keep or replace stop words, URLs, puncuations, emojis. For example,
        - URLs can be replaced with standard tokens like \<URL\>
        - Instead of removing hashtags or out-of-vocabulary words, you may use sequence analysis algorithms to find the most likely combination of words: #whataday &#8594; what a day
- Preprocessing is highly context-dependent
    - The methods you are going to be used will change dramatically depending on your use case!
    - Therefore, you have to decide what features to remove, replace or keep depending on your research.

# 1-  Text Preprocessing <a class="anchor" id="text_preproc"></a>

- For a comprehensive overview Text Preprocessing for Social Media text, please see the following tutorial:
    - [Preprocessing Social Media Text](https://bit.ly/nlpcss201-preproc) by [Steve Wilson](https://steverw.com/) tutorial presented at [NLP+CSS 201 tutorial series](https://nlp-css-201-tutorials.github.io/nlp-css-201-tutorials/).

In [1]:
import pandas as pd

# Import data
df = pd.read_csv("https://raw.githubusercontent.com/steve-wilson/nlpcss201-sm-preprocessing/main/celebrity_tweets_no_rts.csv")
df.head(3)

Unnamed: 0,user,user_type,tweet_id,created_at,text,expanded_urls
0,nicolebyer,comedy,1501332398092414978,2022-03-08 23:01:51+00:00,@ashleyn1cole 🥰💜🥰💜,
1,nicolebyer,comedy,1501241039440400385,2022-03-08 16:58:50+00:00,😭😭😭😭😭😭 https://t.co/X7LngGNLlt,https://twitter.com/laurenlapkus/status/150124...
2,nicolebyer,comedy,1500957090990362625,2022-03-07 22:10:31+00:00,Yah watch #GrandCrew and tell ya friends to wa...,https://twitter.com/eclecticjunki3/status/1500...


In [2]:
# Examples from the dataset
row_ids = [51, 2054, 7661, 12525, 29235, 30183, 50312, 52006, 54902]
for row_id in row_ids:
    row = df.iloc[row_id]
    print(f"@{row.user}:")
    print(row.text)
    if type(row.expanded_urls) == str:
        print('Expanded URLs:',row.expanded_urls)
    print('---')

@nicolebyer:
@BisHilarious 💜💜💜🥳🥳🥳
---
@serenawilliams:
I’m somehow watching IronMan 1 again 🤷🏿‍♀️🤩 @Marvel
---
@jvn:
Watching Kaori cry is making me SOB #WinterOlympics
---
@AOC:
Top winter picket line accessories:
📢 Bullhorn
🧤 Gloves
♨️ Handwarmers
🤝 Solidarity

Shout out @TeenVogue for covering these important issues with the depth that they deserve.

Let’s get these produce workers the buck they’re asking for. 💪🏽 https://t.co/uz6lMSQuPE
Expanded URLs: https://twitter.com/TeamstersJC16/status/1352448474416107526
---
@Oprah:
Do u all agree we are each versions of each other? #SuperSoulSunday
---
@usainbolt:
Yessss @Cristiano @ManUtd 🙌🏿
---
@Zendaya:
HAPPY #BeyDay !!!!!!🐝👑
---
@TheRock:
I’m pumped up like I’m getting snaps at D-End tomorrow 🤣💪🏾🏈
Can’t wait for you to see what myself, @NFL &amp; @NBC have cookin’ 😊💥 🎤 
You guys know my NFL dream never came true, so is this is a full circle ⭕️ moment to stand on this field
#Gratitude
#RockAtThe50
#FINALLY
#SBLVI https://t.co/yyIhvBADt6
E

### Removing Puncuation <a class="anchor" id="remove_punc"></a>

Three methods to remove punctuation:
- string.isalnum()
- string.replace()
- Regex

In [3]:
from string import punctuation

In [4]:
for punc in punctuation:
    print(punc, end= " ")

! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~ 

In [5]:
tweet = df.loc[52006, "text"]
print(tweet)

I’m pumped up like I’m getting snaps at D-End tomorrow 🤣💪🏾🏈
Can’t wait for you to see what myself, @NFL &amp; @NBC have cookin’ 😊💥 🎤 
You guys know my NFL dream never came true, so is this is a full circle ⭕️ moment to stand on this field
#Gratitude
#RockAtThe50
#FINALLY
#SBLVI https://t.co/yyIhvBADt6


#### string.isalnum()

In [6]:
special_string="spe@#$ci87al*&"
print(''.join([char for char in special_string if char.isalnum()]))

speci87al


In [7]:
print(''.join([char for char in tweet if char.isalnum()]))

ImpumpeduplikeImgettingsnapsatDEndtomorrowCantwaitforyoutoseewhatmyselfNFLampNBChavecookinYouguysknowmyNFLdreamnevercametruesoisthisisafullcirclemomenttostandonthisfieldGratitudeRockAtThe50FINALLYSBLVIhttpstcoyyIhvBADt6


In [8]:
print(''.join([char for char in tweet if char.isalnum() or char.isspace()]))

Im pumped up like Im getting snaps at DEnd tomorrow 
Cant wait for you to see what myself NFL amp NBC have cookin   
You guys know my NFL dream never came true so is this is a full circle  moment to stand on this field
Gratitude
RockAtThe50
FINALLY
SBLVI httpstcoyyIhvBADt6


#### string.replace()

In [9]:
text = "Hello world!"
print(text.replace("!", ""))

Hello world


In [71]:
tweet_processed = tweet

In [72]:
for punc in list(punctuation):
    tweet_processed = tweet_processed.replace(punc, "")

In [73]:
print(f"Tweet:\n{tweet}\n")
print(f"Tweet Processed:\n{tweet_processed}\n") ## Apostrophe (’) is not included in python's string.punctuation

Tweet:
I’m pumped up like I’m getting snaps at D-End tomorrow 🤣💪🏾🏈
Can’t wait for you to see what myself, @NFL &amp; @NBC have cookin’ 😊💥 🎤 
You guys know my NFL dream never came true, so is this is a full circle ⭕️ moment to stand on this field
#Gratitude
#RockAtThe50
#FINALLY
#SBLVI https://t.co/yyIhvBADt6

Tweet Processed:
I’m pumped up like I’m getting snaps at DEnd tomorrow 🤣💪🏾🏈
Can’t wait for you to see what myself NFL amp NBC have cookin’ 😊💥 🎤 
You guys know my NFL dream never came true so is this is a full circle ⭕️ moment to stand on this field
Gratitude
RockAtThe50
FINALLY
SBLVI httpstcoyyIhvBADt6



In [13]:
tweet_processed = tweet
for punc in list(punctuation) + ["’"]:
    tweet_processed = tweet_processed.replace(punc, "")

In [14]:
print(f"Tweet:\n{tweet}\n")

print(f"Tweet Processed:\n{tweet_processed}\n")

Tweet:
I’m pumped up like I’m getting snaps at D-End tomorrow 🤣💪🏾🏈
Can’t wait for you to see what myself, @NFL &amp; @NBC have cookin’ 😊💥 🎤 
You guys know my NFL dream never came true, so is this is a full circle ⭕️ moment to stand on this field
#Gratitude
#RockAtThe50
#FINALLY
#SBLVI https://t.co/yyIhvBADt6

Tweet Processed:
Im pumped up like Im getting snaps at DEnd tomorrow 🤣💪🏾🏈
Cant wait for you to see what myself NFL amp NBC have cookin 😊💥 🎤 
You guys know my NFL dream never came true so is this is a full circle ⭕️ moment to stand on this field
Gratitude
RockAtThe50
FINALLY
SBLVI httpstcoyyIhvBADt6



#### Regex (Regular Expression)

Tutorials:
- [Python RegEx](https://www.w3schools.com/python/python_regex.asp) from [W3Schools](https://www.w3schools.com)
- [Regular Expressions: Regexes in Python (Part 1)](https://realpython.com/regex-python/) from [Real Python](https://realpython.com/)
- [Regular Expressions: Regexes in Python (Part 2)](https://realpython.com/regex-python-part-2/) from [Real Python](https://realpython.com/)
- [Python Regular Expression Tutorial](https://www.datacamp.com/tutorial/python-regular-expression-tutorial) from [DataCamp](https://www.datacamp.com/)

In [15]:
import re

In [16]:
pattern = r'[^\w\s0-9]' # remove everything except words, space and numbers

print(re.sub(pattern, "", tweet))

Im pumped up like Im getting snaps at DEnd tomorrow 
Cant wait for you to see what myself NFL amp NBC have cookin   
You guys know my NFL dream never came true so is this is a full circle  moment to stand on this field
Gratitude
RockAtThe50
FINALLY
SBLVI httpstcoyyIhvBADt6


In [17]:
pattern = r'[^\w\s0-9#@]' # remove everything except words, space, numbers, # and @

print(re.sub(pattern, "", tweet))

Im pumped up like Im getting snaps at DEnd tomorrow 
Cant wait for you to see what myself @NFL amp @NBC have cookin   
You guys know my NFL dream never came true so is this is a full circle  moment to stand on this field
#Gratitude
#RockAtThe50
#FINALLY
#SBLVI httpstcoyyIhvBADt6


### Removing URLs  <a class="anchor" id="remove_url"></a>

In [18]:
# URL pattern for shortened Twitter URLs
url_pattern = "https?://t\.co/\S+"

print(tweet)
print(re.findall(url_pattern, tweet))

I’m pumped up like I’m getting snaps at D-End tomorrow 🤣💪🏾🏈
Can’t wait for you to see what myself, @NFL &amp; @NBC have cookin’ 😊💥 🎤 
You guys know my NFL dream never came true, so is this is a full circle ⭕️ moment to stand on this field
#Gratitude
#RockAtThe50
#FINALLY
#SBLVI https://t.co/yyIhvBADt6
['https://t.co/yyIhvBADt6']


In [19]:
# URL pattern for expanded URLs (source: https://uibakery.io/regex-library/url-regex-python)
url_pattern = "^https?:\\/\\/(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)$"

In [20]:
expanded_url = df.loc[52006, "expanded_urls"]
print(expanded_url)
print(re.findall(url_pattern, expanded_url))

https://twitter.com/NFL/status/1492561005984899072
['https://twitter.com/NFL/status/1492561005984899072']


In [21]:
expanded_url = df.loc[176, "expanded_urls"]
print(expanded_url)
print(re.findall(url_pattern, expanded_url))

https://findtheone.byspotify.com/share/db2a80bd0f2dd362325cb5138cd6b9b555b43481
['https://findtheone.byspotify.com/share/db2a80bd0f2dd362325cb5138cd6b9b555b43481']


### Lower Casing  <a class="anchor" id="lower"></a>

In [22]:
print(tweet)

I’m pumped up like I’m getting snaps at D-End tomorrow 🤣💪🏾🏈
Can’t wait for you to see what myself, @NFL &amp; @NBC have cookin’ 😊💥 🎤 
You guys know my NFL dream never came true, so is this is a full circle ⭕️ moment to stand on this field
#Gratitude
#RockAtThe50
#FINALLY
#SBLVI https://t.co/yyIhvBADt6


In [23]:
print(tweet.lower())

i’m pumped up like i’m getting snaps at d-end tomorrow 🤣💪🏾🏈
can’t wait for you to see what myself, @nfl &amp; @nbc have cookin’ 😊💥 🎤 
you guys know my nfl dream never came true, so is this is a full circle ⭕️ moment to stand on this field
#gratitude
#rockatthe50
#finally
#sblvi https://t.co/yyihvbadt6


- [unicode_tr](https://github.com/emre/unicode_tr): A python module to make unicode strings work as expected for Turkish characters. Solves the turkish "İ" problem.

In [24]:
from unicode_tr import unicode_tr

In [25]:
text_true = unicode_tr(u"istanbul")
text_wrong = u"istanbul"

In [77]:
type(text_true)

unicode_tr.unicode_tr

In [78]:
type(text_wrong)

str

In [26]:
# string.upper
print(text_true.upper(), text_wrong.upper())

İSTANBUL ISTANBUL


In [27]:
# string.capitalize
print(text_true.capitalize(), text_wrong.capitalize())

İstanbul Istanbul


In [28]:
# string.lower
text_true  = unicode_tr(u"ÇINAR")
text_false = u"ÇINAR"
print(text_true.lower(), text_false.lower())

çınar çinar


In [29]:
# string.title
text_true  = unicode_tr(u"izmir istanbul")
text_false = u"izmir istanbul"

print(text_true.title(), text_false.title())

İzmir İstanbul Izmir Istanbul


### Tokenization  <a class="anchor" id="tokenize"></a>

In [30]:
from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer

In [31]:
print(tweet)
print(word_tokenize(tweet))

I’m pumped up like I’m getting snaps at D-End tomorrow 🤣💪🏾🏈
Can’t wait for you to see what myself, @NFL &amp; @NBC have cookin’ 😊💥 🎤 
You guys know my NFL dream never came true, so is this is a full circle ⭕️ moment to stand on this field
#Gratitude
#RockAtThe50
#FINALLY
#SBLVI https://t.co/yyIhvBADt6
['I', '’', 'm', 'pumped', 'up', 'like', 'I', '’', 'm', 'getting', 'snaps', 'at', 'D-End', 'tomorrow', '🤣💪🏾🏈', 'Can', '’', 't', 'wait', 'for', 'you', 'to', 'see', 'what', 'myself', ',', '@', 'NFL', '&', 'amp', ';', '@', 'NBC', 'have', 'cookin', '’', '😊💥', '🎤', 'You', 'guys', 'know', 'my', 'NFL', 'dream', 'never', 'came', 'true', ',', 'so', 'is', 'this', 'is', 'a', 'full', 'circle', '⭕️', 'moment', 'to', 'stand', 'on', 'this', 'field', '#', 'Gratitude', '#', 'RockAtThe50', '#', 'FINALLY', '#', 'SBLVI', 'https', ':', '//t.co/yyIhvBADt6']


In [32]:
tt = TweetTokenizer()
print(tweet)
print(tt.tokenize(tweet))

I’m pumped up like I’m getting snaps at D-End tomorrow 🤣💪🏾🏈
Can’t wait for you to see what myself, @NFL &amp; @NBC have cookin’ 😊💥 🎤 
You guys know my NFL dream never came true, so is this is a full circle ⭕️ moment to stand on this field
#Gratitude
#RockAtThe50
#FINALLY
#SBLVI https://t.co/yyIhvBADt6
['I', '’', 'm', 'pumped', 'up', 'like', 'I', '’', 'm', 'getting', 'snaps', 'at', 'D-End', 'tomorrow', '🤣', '💪🏾', '🏈', 'Can', '’', 't', 'wait', 'for', 'you', 'to', 'see', 'what', 'myself', ',', '@NFL', '&', '@NBC', 'have', 'cookin', '’', '😊', '💥', '🎤', 'You', 'guys', 'know', 'my', 'NFL', 'dream', 'never', 'came', 'true', ',', 'so', 'is', 'this', 'is', 'a', 'full', 'circle', '⭕', '️', 'moment', 'to', 'stand', 'on', 'this', 'field', '#Gratitude', '#RockAtThe50', '#FINALLY', '#SBLVI', 'https://t.co/yyIhvBADt6']


### Removing Stop Words  <a class="anchor" id="remove_stop"></a>

In [33]:
from nltk.corpus import stopwords

In [34]:
print("English:")
print(stopwords.words('english'))
print("\nTurkish:")
print(stopwords.words('turkish'))

English:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 

In [35]:
tweet

'I’m pumped up like I’m getting snaps at D-End tomorrow 🤣💪🏾🏈\nCan’t wait for you to see what myself, @NFL &amp; @NBC have cookin’ 😊💥 🎤 \nYou guys know my NFL dream never came true, so is this is a full circle ⭕️ moment to stand on this field\n#Gratitude\n#RockAtThe50\n#FINALLY\n#SBLVI https://t.co/yyIhvBADt6'

In [36]:
print("Tokenized Tweet:")
print([word for word in tt.tokenize(tweet)])
print("\nTokenized Tweet without Stop Words:")
print([word for word in tt.tokenize(tweet) if word.lower() not in stopwords.words('english')])

Tokenized Tweet:
['I', '’', 'm', 'pumped', 'up', 'like', 'I', '’', 'm', 'getting', 'snaps', 'at', 'D-End', 'tomorrow', '🤣', '💪🏾', '🏈', 'Can', '’', 't', 'wait', 'for', 'you', 'to', 'see', 'what', 'myself', ',', '@NFL', '&', '@NBC', 'have', 'cookin', '’', '😊', '💥', '🎤', 'You', 'guys', 'know', 'my', 'NFL', 'dream', 'never', 'came', 'true', ',', 'so', 'is', 'this', 'is', 'a', 'full', 'circle', '⭕', '️', 'moment', 'to', 'stand', 'on', 'this', 'field', '#Gratitude', '#RockAtThe50', '#FINALLY', '#SBLVI', 'https://t.co/yyIhvBADt6']

Tokenized Tweet without Stop Words:
['’', 'pumped', 'like', '’', 'getting', 'snaps', 'D-End', 'tomorrow', '🤣', '💪🏾', '🏈', '’', 'wait', 'see', ',', '@NFL', '&', '@NBC', 'cookin', '’', '😊', '💥', '🎤', 'guys', 'know', 'NFL', 'dream', 'never', 'came', 'true', ',', 'full', 'circle', '⭕', '️', 'moment', 'stand', 'field', '#Gratitude', '#RockAtThe50', '#FINALLY', '#SBLVI', 'https://t.co/yyIhvBADt6']


### Removing Emoji  <a class="anchor" id="remove_emoji"></a>

In [38]:
# Emoji is represented as sequence of bytes
print('😊'.encode("UTF-8"))

b'\xf0\x9f\x98\x8a'


In [37]:
import emoji

In [39]:
print(emoji.is_emoji('😊'))

True


In [40]:
print(emoji.demojize('😊'))

:smiling_face_with_smiling_eyes:


In [41]:
print(emoji.demojize('🏈'))

:american_football:


In [42]:
print("Tokenized tweet:")
print(tt.tokenize(tweet))
print("\nTokenized tweet with emojis demojized:")
print([emoji.demojize(token) if emoji.is_emoji(token) else token for token in tt.tokenize(tweet)])

Tokenized tweet:
['I', '’', 'm', 'pumped', 'up', 'like', 'I', '’', 'm', 'getting', 'snaps', 'at', 'D-End', 'tomorrow', '🤣', '💪🏾', '🏈', 'Can', '’', 't', 'wait', 'for', 'you', 'to', 'see', 'what', 'myself', ',', '@NFL', '&', '@NBC', 'have', 'cookin', '’', '😊', '💥', '🎤', 'You', 'guys', 'know', 'my', 'NFL', 'dream', 'never', 'came', 'true', ',', 'so', 'is', 'this', 'is', 'a', 'full', 'circle', '⭕', '️', 'moment', 'to', 'stand', 'on', 'this', 'field', '#Gratitude', '#RockAtThe50', '#FINALLY', '#SBLVI', 'https://t.co/yyIhvBADt6']

Tokenized tweet with emojis demojized:
['I', '’', 'm', 'pumped', 'up', 'like', 'I', '’', 'm', 'getting', 'snaps', 'at', 'D-End', 'tomorrow', ':rolling_on_the_floor_laughing:', ':flexed_biceps_medium-dark_skin_tone:', ':american_football:', 'Can', '’', 't', 'wait', 'for', 'you', 'to', 'see', 'what', 'myself', ',', '@NFL', '&', '@NBC', 'have', 'cookin', '’', ':smiling_face_with_smiling_eyes:', ':collision:', ':microphone:', 'You', 'guys', 'know', 'my', 'NFL', 'dream'

### Stemming & Lemmatization  <a class="anchor" id="stem_lemma"></a>

- They are both vocabulary reduction techniques.
- **Stemming**: The process of reducing a word to its most basic form. Examples:
    - "party", "partying", "parties" &#8594; "parti"
    - "programming", "programmer", "programs" &#8594; "program"
- **Lemmatization**: A technique to reduce inflected words to their root word. Examples:
    - "runs", "running", "ran" &#8594; "run"
    - "am", "is", "are" &#8594; "be"
- [Stemming and Lemmatization in Python](https://www.datacamp.com/tutorial/stemming-lemmatization-python) by [DataCamp](https://www.datacamp.com/)

#### Stemming

In [43]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

In [44]:
print(tt.tokenize(tweet))

['I', '’', 'm', 'pumped', 'up', 'like', 'I', '’', 'm', 'getting', 'snaps', 'at', 'D-End', 'tomorrow', '🤣', '💪🏾', '🏈', 'Can', '’', 't', 'wait', 'for', 'you', 'to', 'see', 'what', 'myself', ',', '@NFL', '&', '@NBC', 'have', 'cookin', '’', '😊', '💥', '🎤', 'You', 'guys', 'know', 'my', 'NFL', 'dream', 'never', 'came', 'true', ',', 'so', 'is', 'this', 'is', 'a', 'full', 'circle', '⭕', '️', 'moment', 'to', 'stand', 'on', 'this', 'field', '#Gratitude', '#RockAtThe50', '#FINALLY', '#SBLVI', 'https://t.co/yyIhvBADt6']


In [45]:
print([stemmer.stem(word) for word in tt.tokenize(tweet)])

['i', '’', 'm', 'pump', 'up', 'like', 'i', '’', 'm', 'get', 'snap', 'at', 'd-end', 'tomorrow', '🤣', '💪🏾', '🏈', 'can', '’', 't', 'wait', 'for', 'you', 'to', 'see', 'what', 'myself', ',', '@nfl', '&', '@nbc', 'have', 'cookin', '’', '😊', '💥', '🎤', 'you', 'guy', 'know', 'my', 'nfl', 'dream', 'never', 'came', 'true', ',', 'so', 'is', 'thi', 'is', 'a', 'full', 'circl', '⭕', '️', 'moment', 'to', 'stand', 'on', 'thi', 'field', '#gratitud', '#rockatthe50', '#final', '#sblvi', 'https://t.co/yyihvbadt6']


#### Lemmatization

In [46]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet 

lemmatizer = WordNetLemmatizer()

In [47]:
print(tt.tokenize(tweet))

['I', '’', 'm', 'pumped', 'up', 'like', 'I', '’', 'm', 'getting', 'snaps', 'at', 'D-End', 'tomorrow', '🤣', '💪🏾', '🏈', 'Can', '’', 't', 'wait', 'for', 'you', 'to', 'see', 'what', 'myself', ',', '@NFL', '&', '@NBC', 'have', 'cookin', '’', '😊', '💥', '🎤', 'You', 'guys', 'know', 'my', 'NFL', 'dream', 'never', 'came', 'true', ',', 'so', 'is', 'this', 'is', 'a', 'full', 'circle', '⭕', '️', 'moment', 'to', 'stand', 'on', 'this', 'field', '#Gratitude', '#RockAtThe50', '#FINALLY', '#SBLVI', 'https://t.co/yyIhvBADt6']


In [48]:
print([lemmatizer.lemmatize(word) for word in tt.tokenize(tweet)])

['I', '’', 'm', 'pumped', 'up', 'like', 'I', '’', 'm', 'getting', 'snap', 'at', 'D-End', 'tomorrow', '🤣', '💪🏾', '🏈', 'Can', '’', 't', 'wait', 'for', 'you', 'to', 'see', 'what', 'myself', ',', '@NFL', '&', '@NBC', 'have', 'cookin', '’', '😊', '💥', '🎤', 'You', 'guy', 'know', 'my', 'NFL', 'dream', 'never', 'came', 'true', ',', 'so', 'is', 'this', 'is', 'a', 'full', 'circle', '⭕', '️', 'moment', 'to', 'stand', 'on', 'this', 'field', '#Gratitude', '#RockAtThe50', '#FINALLY', '#SBLVI', 'https://t.co/yyIhvBADt6']


In [49]:
print(lemmatizer.lemmatize("is"))
print(lemmatizer.lemmatize("is",wordnet.VERB))

is
be


### Apply All Preprocessing Steps to Data

In [50]:
def preprocess_tweet(text):
    tt = TweetTokenizer()
    # Remove URLs
    url_pattern = "https?://t\.co/\S+"
    text = re.sub(url_pattern, "", text)
    # Lower case
    text = text.lower()
    # Tokenization
    text_tokenized = tt.tokenize(text)
    # Remove stop words
    text_tokenized = [token for token in text_tokenized if token not in stopwords.words('english')]
    # Demojize
    text_tokenized = [emoji.demojize(token) if emoji.is_emoji(token) else token for token in text_tokenized]
    # Remove punctuation
    text_tokenized = [token for token in text_tokenized if token not in punctuation]
    # Stemming
    text_tokenized = [stemmer.stem(token) for token in text_tokenized]
    return ' '.join(text_tokenized)

In [None]:
#df["preprocessed_text"] = df["text"].apply(lambda text: preprocess_tweet(text))

In [52]:
preprocessed_texts = []

for i, text in enumerate(list(df["text"])):
    if i % 100 == 0:
        print(f"{i:,}/{df.shape[0]:,} | {i/df.shape[0]*100:.2f}%")
    preprocessed_texts.append(preprocess_tweet(df.loc[i, "text"]))
print(f"{df.shape[0]:,}/{df.shape[0]:,} | {df.shape[0]/df.shape[0]*100:.2f}%")

0/55,080 | 0.00%
100/55,080 | 0.18%
200/55,080 | 0.36%
300/55,080 | 0.54%
400/55,080 | 0.73%
500/55,080 | 0.91%
600/55,080 | 1.09%
700/55,080 | 1.27%
800/55,080 | 1.45%
900/55,080 | 1.63%
1,000/55,080 | 1.82%
1,100/55,080 | 2.00%
1,200/55,080 | 2.18%
1,300/55,080 | 2.36%
1,400/55,080 | 2.54%
1,500/55,080 | 2.72%
1,600/55,080 | 2.90%
1,700/55,080 | 3.09%
1,800/55,080 | 3.27%
1,900/55,080 | 3.45%
2,000/55,080 | 3.63%
2,100/55,080 | 3.81%
2,200/55,080 | 3.99%
2,300/55,080 | 4.18%
2,400/55,080 | 4.36%
2,500/55,080 | 4.54%
2,600/55,080 | 4.72%
2,700/55,080 | 4.90%
2,800/55,080 | 5.08%
2,900/55,080 | 5.27%
3,000/55,080 | 5.45%
3,100/55,080 | 5.63%
3,200/55,080 | 5.81%
3,300/55,080 | 5.99%
3,400/55,080 | 6.17%
3,500/55,080 | 6.35%
3,600/55,080 | 6.54%
3,700/55,080 | 6.72%
3,800/55,080 | 6.90%
3,900/55,080 | 7.08%
4,000/55,080 | 7.26%
4,100/55,080 | 7.44%
4,200/55,080 | 7.63%
4,300/55,080 | 7.81%
4,400/55,080 | 7.99%
4,500/55,080 | 8.17%
4,600/55,080 | 8.35%
4,700/55,080 | 8.53%
4,800/55,080 |

36,400/55,080 | 66.09%
36,500/55,080 | 66.27%
36,600/55,080 | 66.45%
36,700/55,080 | 66.63%
36,800/55,080 | 66.81%
36,900/55,080 | 66.99%
37,000/55,080 | 67.18%
37,100/55,080 | 67.36%
37,200/55,080 | 67.54%
37,300/55,080 | 67.72%
37,400/55,080 | 67.90%
37,500/55,080 | 68.08%
37,600/55,080 | 68.26%
37,700/55,080 | 68.45%
37,800/55,080 | 68.63%
37,900/55,080 | 68.81%
38,000/55,080 | 68.99%
38,100/55,080 | 69.17%
38,200/55,080 | 69.35%
38,300/55,080 | 69.54%
38,400/55,080 | 69.72%
38,500/55,080 | 69.90%
38,600/55,080 | 70.08%
38,700/55,080 | 70.26%
38,800/55,080 | 70.44%
38,900/55,080 | 70.62%
39,000/55,080 | 70.81%
39,100/55,080 | 70.99%
39,200/55,080 | 71.17%
39,300/55,080 | 71.35%
39,400/55,080 | 71.53%
39,500/55,080 | 71.71%
39,600/55,080 | 71.90%
39,700/55,080 | 72.08%
39,800/55,080 | 72.26%
39,900/55,080 | 72.44%
40,000/55,080 | 72.62%
40,100/55,080 | 72.80%
40,200/55,080 | 72.98%
40,300/55,080 | 73.17%
40,400/55,080 | 73.35%
40,500/55,080 | 73.53%
40,600/55,080 | 73.71%
40,700/55,0

In [53]:
df["preprocessed_text"] = pd.Series(preprocessed_texts)

In [54]:
for i in df.sample(30, random_state=0).index:
    print("- Text:")
    print(df.loc[i, "text"])
    print("\n- Preprocessed Text:")
    print(df.loc[i, "preprocessed_text"])
    print("-"*60)

- Text:
As a man, it’s a whole other level when you hold another grown man with tears in his eyes. 
I’ve held a few. 
A few have held me. 
This man lost his mama and she used to compare us. 
You’re right brother,… https://t.co/Y8p0SsIvRS

- Preprocessed Text:
man ’ whole level hold anoth grown man tear eye ’ held held man lost mama use compar us ’ right brother …
------------------------------------------------------------
- Text:
Of all the pictures @joethomas73 can asks me to sign. This is how I look at my wife... and cheat meals 😂. 
He’s an inspiration, good buddy and good man. 
10,363 (if you know, you know) 
#goat 
@NFL https://t.co/FwLDbsmgMZ

- Preprocessed Text:
pictur @joethomas73 ask sign look wife ... cheat meal :face_with_tears_of_joy: ’ inspir good buddi good man 10,363 know know #goat @nfl
------------------------------------------------------------
- Text:
Introducing @SKIMS Teddy, our first all-weather collection. Drops Thursday, September 9 at 9AM PT. https://t.co/Qsy5

## 2-  Converting Text to Numbers (Text Vectorization) <a class="anchor" id="vectorization"></a>

### Bag-of-Words (BoW) <a class="anchor" id="bow"></a>

![bow](img/bow.png)

Image Source: https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/

In [55]:
from sklearn.feature_extraction.text import CountVectorizer

In [56]:
count_vectorizer = CountVectorizer(min_df=50) # min_df=50 means "ignore terms that appear in less than 50 documents".

In [57]:
BoW = count_vectorizer.fit_transform(df["preprocessed_text"])

In [58]:
pd.DataFrame(BoW.todense(), columns=count_vectorizer.get_feature_names_out(), index=df["preprocessed_text"])

Unnamed: 0_level_0,000,10,100,11,12,12pm,13,14,15,16,...,रह,वन,वर,शन,सभ,सम,सर,सरक,हम,हर
preprocessed_text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
@ashleyn1col :smiling_face_with_hearts: :purple_heart: :smiling_face_with_hearts: :purple_heart:,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
:loudly_crying_face: :loudly_crying_face: :loudly_crying_face:,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
yah watch #grandcrew tell ya friend watch,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
@nanglish @grandcrewnbc :purple_heart: :purple_heart: :purple_heart:,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
oooooh @ozzymo :purple_heart: :purple_heart: :purple_heart:,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
tax day they'r like eat cooki dough like ok i'll make sacrific art,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
second glanc #reput ... readi,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
#lwymmdvideo,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
offici #lwymmdvideo world premier sunday 8/ 27 @vma,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Term Frequency–Inverse Document Frequency (TF-IDF) <a class="anchor" id="tfidf"></a>

$t=term$

$d=document$

$n_{t, d}=\text{number of times t occurs in d}$


$\textit{tf}_{t,d}=\frac{n_{t, d}}{\text{Number of terms in the document}}$

![tfidf](img/tfidf.png)

Image Source: https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/

In [59]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [60]:
tfidf_vectorizer = TfidfVectorizer(min_df=50)

In [61]:
tfidf = tfidf_vectorizer.fit_transform(df["preprocessed_text"])

pd.DataFrame(tfidf.todense(), columns=tfidf_vectorizer.get_feature_names_out(), index=df["preprocessed_text"])

Unnamed: 0_level_0,000,10,100,11,12,12pm,13,14,15,16,...,रह,वन,वर,शन,सभ,सम,सर,सरक,हम,हर
preprocessed_text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
@ashleyn1col :smiling_face_with_hearts: :purple_heart: :smiling_face_with_hearts: :purple_heart:,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
:loudly_crying_face: :loudly_crying_face: :loudly_crying_face:,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
yah watch #grandcrew tell ya friend watch,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
@nanglish @grandcrewnbc :purple_heart: :purple_heart: :purple_heart:,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
oooooh @ozzymo :purple_heart: :purple_heart: :purple_heart:,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
tax day they'r like eat cooki dough like ok i'll make sacrific art,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
second glanc #reput ... readi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
#lwymmdvideo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
offici #lwymmdvideo world premier sunday 8/ 27 @vma,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Build a Simple Classifier <a class="anchor" id="classifier"></a>

In [62]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [63]:
def train_predict_score(X, y, vectorizer, classifier):
    X = vectorizer.fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=530)
    classifier.fit(X_train, y_train)
    pred = classifier.predict(X_test)
    return metrics.accuracy_score(y_test, pred), metrics.precision_score(y_test, pred, average="weighted"), metrics.f1_score(y_test, pred, average="weighted")

In [64]:
vectorizers = ["Bag-of-Words", "TF-IDF"]
classifiers = ["Logistic Regression", "Random Forest", "Support Vectors", "Multinomial Naive Bayes"]

vectorizer_name_to_vectorizer = {"Bag-of-Words":CountVectorizer(min_df=50), "TF-IDF":TfidfVectorizer(min_df=50)}
classifier_name_to_classifier = {"Logistic Regression":LogisticRegression(max_iter=1000), "Random Forest":RandomForestClassifier(),
                                 "Support Vectors": LinearSVC(max_iter=1000), "Multinomial Naive Bayes":MultinomialNB()}

In [65]:
for classifier in classifiers:
    for vectorizer in vectorizers:
        accuracy, precision, f1 = train_predict_score(X=df["preprocessed_text"],
                                                      y=df["user_type"],
                                                      vectorizer=vectorizer_name_to_vectorizer[vectorizer],
                                                      classifier=classifier_name_to_classifier[classifier])
        print(f"Vectorizer: {vectorizer} | Classifier: {classifier}\n")
        print(f"Accuracy Score: {accuracy:.3f}")
        print(f"Precision Score (weighted): {precision:.3f}")
        print(f"F1 Score (weighted): {f1:.3f}")
        print("-"*60)

Vectorizer: Bag-of-Words | Classifier: Logistic Regression

Accuracy Score: 0.636
Precision Score (weighted): 0.654
F1 Score (weighted): 0.638
------------------------------------------------------------
Vectorizer: TF-IDF | Classifier: Logistic Regression

Accuracy Score: 0.641
Precision Score (weighted): 0.651
F1 Score (weighted): 0.643
------------------------------------------------------------
Vectorizer: Bag-of-Words | Classifier: Random Forest

Accuracy Score: 0.607
Precision Score (weighted): 0.622
F1 Score (weighted): 0.609
------------------------------------------------------------
Vectorizer: TF-IDF | Classifier: Random Forest

Accuracy Score: 0.618
Precision Score (weighted): 0.628
F1 Score (weighted): 0.620
------------------------------------------------------------




Vectorizer: Bag-of-Words | Classifier: Support Vectors

Accuracy Score: 0.639
Precision Score (weighted): 0.656
F1 Score (weighted): 0.641
------------------------------------------------------------
Vectorizer: TF-IDF | Classifier: Support Vectors

Accuracy Score: 0.638
Precision Score (weighted): 0.647
F1 Score (weighted): 0.639
------------------------------------------------------------
Vectorizer: Bag-of-Words | Classifier: Multinomial Naive Bayes

Accuracy Score: 0.623
Precision Score (weighted): 0.633
F1 Score (weighted): 0.624
------------------------------------------------------------
Vectorizer: TF-IDF | Classifier: Multinomial Naive Bayes

Accuracy Score: 0.624
Precision Score (weighted): 0.637
F1 Score (weighted): 0.626
------------------------------------------------------------
