<a href="https://colab.research.google.com/github/adel-nouar/ML_with_Rune/blob/main/11%20-%20Project%20-%20Natural%20Language%20Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project - Natural Language Processing
- Can you determine who tweeted this?

### Description
- We will analyze a collection of tweets from one tweet account
- Can we figure out the person behind the account?

### Step 1: Import libraries

In [6]:
import pandas as pd
from nltk import word_tokenize, ngrams, punkt
from collections import Counter

In [9]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### Step 2: Import data
- Use Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) method to read files/tweets.csv

In [2]:
data = pd.read_csv('files/tweets.csv')
data.head()

Unnamed: 0,date,content
0,2009-05-04 13:54:25,Be sure to tune in and watch John Doe on Late ...
1,2009-05-04 20:00:10,John Doe will be appearing on The View tomorro...
2,2009-05-08 08:38:08,John Doe reads Top Ten Financial Tips on Late ...
3,2009-05-08 15:40:15,New Blog Post: Celebrity Apprentice Finale and...
4,2009-05-12 09:07:28,"""My persona will never be that of a wallflower..."


### Step 3: Convert content to a list of content
- Use list on the column **content**
    - You can also apply [to_list()](https://pandas.pydata.org/docs/reference/api/pandas.Series.to_list.html) on the column

In [3]:
content = list(data['content'])

In [4]:
len(content)

43352

### Step 4: Create a corpus
- Create an empty list called **corpus**
- Iterate over **content**
    - Extend **corpus** with all words in lowercase if any character is alpha in the word.
        - HINT: To lowercase, call **lower()** on the word.
        - HINT: To check if any character is alhpa, use **any(c.isalpha() for c in word)**

In [10]:
corpus = []
for item in content:
  corpus.extend([word.lower() for word in word_tokenize(item) if any(c.isalpha() for c in word)])

### Step 5: Check corpus
- Find the length of the corpus
- Look at the first 10 words in the corpus

In [11]:
len(corpus)

850290

In [12]:
corpus[:10]

['be', 'sure', 'to', 'tune', 'in', 'and', 'watch', 'john', 'doe', 'on']

### Step 6: Display all 3-grams
- Use **Counter(ngrams(corpus, 3))** and assign it to a variable
- List the 10 most common 3-grams
    - HINT: call **most_common(10)** on the result from **Counter(...)**

In [13]:
ngram = Counter(ngrams(corpus, 3))

In [14]:
ngram.most_common(10)

[(('america', 'great', 'again'), 537),
 (('the', 'united', 'states'), 524),
 (('i', 'will', 'be'), 522),
 (('make', 'america', 'great'), 501),
 (('run', 'for', 'president'), 395),
 (('one', 'of', 'the'), 352),
 (('the', 'fake', 'news'), 344),
 (('the', 'white', 'house'), 288),
 (('all', 'of', 'the'), 280),
 (('thank', 'you', 'to'), 274)]

### Step 7 (Optional): Pretty print
- Iterate over the result with a for-loop
    - HINT: Each loop gives a **ngram** and **frequency**

In [17]:
for gram, freq in ngram.most_common(10):
  print(f'Frequency: {freq} -> {gram}')

Frequency: 537 -> ('america', 'great', 'again')
Frequency: 524 -> ('the', 'united', 'states')
Frequency: 522 -> ('i', 'will', 'be')
Frequency: 501 -> ('make', 'america', 'great')
Frequency: 395 -> ('run', 'for', 'president')
Frequency: 352 -> ('one', 'of', 'the')
Frequency: 344 -> ('the', 'fake', 'news')
Frequency: 288 -> ('the', 'white', 'house')
Frequency: 280 -> ('all', 'of', 'the')
Frequency: 274 -> ('thank', 'you', 'to')


### Step 8 (Optional): Try it with 4-grams

In [18]:
ngram = Counter(ngrams(corpus, 4))

In [19]:
ngram.most_common(10)

[(('make', 'america', 'great', 'again'), 489),
 (('the', 'great', 'state', 'of'), 173),
 (('the', 'fake', 'news', 'media'), 165),
 (('art', 'of', 'the', 'deal'), 160),
 (('of', 'the', 'united', 'states'), 141),
 (('the', 'art', 'of', 'the'), 137),
 (('in', 'the', 'history', 'of'), 130),
 (('my', 'complete', 'and', 'total'), 116),
 (('complete', 'and', 'total', 'endorsement'), 116),
 (('i', 'will', 'be', 'interviewed'), 113)]

5-grams

In [20]:
ngram = Counter(ngrams(corpus, 5))
ngram.most_common(10)

[(('the', 'art', 'of', 'the', 'deal'), 134),
 (('my', 'complete', 'and', 'total', 'endorsement'), 115),
 (('has', 'my', 'complete', 'and', 'total'), 106),
 (('it', 'was', 'my', 'great', 'honor'), 90),
 (('was', 'my', 'great', 'honor', 'to'), 87),
 (('to', 'make', 'america', 'great', 'again'), 82),
 (('i', 'will', 'be', 'interviewed', 'on'), 65),
 (('president', 'of', 'the', 'united', 'states'), 61),
 (('in', 'the', 'great', 'state', 'of'), 56),
 (('in', 'the', 'history', 'of', 'our'), 56)]