# Project - Natural Language Processing
- Can you determine who tweeted this?

### Description
- We will analyze a collection of tweets from one tweet account
- Can we figure out the person behind the account?

### Step 1: Import libraries

In [95]:
import pandas as pd
from nltk import word_tokenize, ngrams
from collections import Counter

### Step 2: Import data
- Use Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) method to read files/tweets.csv

In [89]:
data = pd.read_csv('files/tweets.csv')
data.head()

Unnamed: 0,date,content
0,2009-05-04 13:54:25,Be sure to tune in and watch John Doe on Late ...
1,2009-05-04 20:00:10,John Doe will be appearing on The View tomorro...
2,2009-05-08 08:38:08,John Doe reads Top Ten Financial Tips on Late ...
3,2009-05-08 15:40:15,New Blog Post: Celebrity Apprentice Finale and...
4,2009-05-12 09:07:28,"""My persona will never be that of a wallflower..."


### Step 3: Convert content to a list of content
- Use list on the column **content**
    - You can also apply [to_list()](https://pandas.pydata.org/docs/reference/api/pandas.Series.to_list.html) on the column

In [85]:
content = list(data['content'])

In [79]:
content = data['content'].to_list()

### Step 4: Create a corpus
- Create an empty list called **corpus**
- Iterate over **content**
    - Extend **corpus** with all words in lowercase if any character is alpha in the word.
        - HINT: To lowercase, call **lower()** on the word.
        - HINT: To check if any character is alhpa, use **any(c.isalpha() for c in word)**

In [54]:
corpus = []
for item in content:
    corpus.extend([word.lower() for word in word_tokenize(item) if any(c.isalpha() for c in word)])

### Step 5: Check corpus
- Find the length of the corpus
- Look at the first 10 words in the corpus

In [55]:
len(corpus)

850410

In [94]:
corpus[:10]

['be', 'sure', 'to', 'tune', 'in', 'and', 'watch', 'donald', 'trump', 'on']

### Step 6: Display all 3-grams
- Use **Counter(ngrams(corpus, 3))** and assign it to a variable
- List the 10 most common 3-grams
    - HINT: call **most_common(10)** on the result from **Counter(...)**

In [100]:
n_grams = Counter(ngrams(corpus, 3))

n_grams.most_common(10)

[(('america', 'great', 'again'), 537),
 (('the', 'united', 'states'), 529),
 (('i', 'will', 'be'), 522),
 (('make', 'america', 'great'), 501),
 (('run', 'for', 'president'), 397),
 (('one', 'of', 'the'), 354),
 (('the', 'fake', 'news'), 347),
 (('the', 'white', 'house'), 288),
 (('all', 'of', 'the'), 280),
 (('thank', 'you', 'to'), 275)]

### Step 7 (Optional): Pretty print
- Iterate over the result with a for-loop
    - HINT: Each loop gives a **ngram** and **frequency**

In [101]:
for ngram, freq in n_grams.most_common(10):
    print(f'{freq}: {ngram}')

537: ('america', 'great', 'again')
529: ('the', 'united', 'states')
522: ('i', 'will', 'be')
501: ('make', 'america', 'great')
397: ('run', 'for', 'president')
354: ('one', 'of', 'the')
347: ('the', 'fake', 'news')
288: ('the', 'white', 'house')
280: ('all', 'of', 'the')
275: ('thank', 'you', 'to')


### Step 8 (Optional): Try it with 4-grams

In [102]:
n_grams = Counter(ngrams(corpus, 4))

n_grams.most_common(10)

[(('make', 'america', 'great', 'again'), 489),
 (('the', 'great', 'state', 'of'), 173),
 (('the', 'fake', 'news', 'media'), 167),
 (('art', 'of', 'the', 'deal'), 160),
 (('of', 'the', 'united', 'states'), 141),
 (('the', 'art', 'of', 'the'), 137),
 (('in', 'the', 'history', 'of'), 131),
 (('my', 'complete', 'and', 'total'), 116),
 (('complete', 'and', 'total', 'endorsement'), 116),
 (('i', 'will', 'be', 'interviewed'), 113)]