## Set up
Let us get all the libaries initialized as necessary

In [1]:
# Run this cell to set up your notebook
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import zipfile
import json

# Ensure that Pandas shows at least 280 characters in columns, so we can see full tweets
pd.set_option('max_colwidth', 280)

%matplotlib inline
plt.style.use('fivethirtyeight')
import seaborn as sns
sns.set()
sns.set_context("talk")
import re

In [None]:
# When you are done, run this cell to load @RutgersU 's tweets.
# Note the function get_tweets_with_cache.  You may find it useful
# later.
rutgers_tweets = get_tweets_with_cache("RutgersU", key_file)
print("Number of tweets downloaded:", len(rutgers_tweets))

## PART 2 - Working with Twitter Data (group and individual)
The json file in data folder contains (to be downloaded by you) some loaded tweets from @RutgersU. Run it and read the code. You can also try other json files in the data folder to try this. 

In [2]:
from pathlib import Path
import json

ds_tweets_save_path = "data/RutgersU_recent_tweets.json"   # need to get this file

# Guarding against attempts to download the data multiple
# times:
if not Path(ds_tweets_save_path).is_file():
    # Getting as many recent tweets by @RutgersU as Twitter will let us have.
    # We use tweet_mode='extended' so that Twitter gives us full 280 character tweets.
    # This was a change introduced in September 2017.
    
    # The tweepy Cursor API actually returns "sophisticated" Status objects but we 
    # will use the basic Python dictionaries stored in the _json field. 
    example_tweets = [t._json for t in tweepy.Cursor(api.user_timeline, screen_name="RutgersU", 
                                             tweet_mode='extended').items()]
    
    # Saving the tweets to a json file on disk for future analysis
    with open(ds_tweets_save_path, "w") as f:        
        json.dump(example_tweets, f)

# Re-loading the json file:
with open(ds_tweets_save_path, "r") as f:
    example_tweets = json.load(f)

NameError: name 'tweepy' is not defined

If things ran as expected, you should be able to look at the first tweet by running the code below. It probabably does not make sense to view all tweets in a notebook, as size of the tweets can freeze your browser (always a good idea to press ctrl-S to save the latest, in case you have to restart Jupyter)

In [3]:
# Looking at one tweet object, which has type Status: 
from pprint import pprint # ...to get a more easily-readable view.
pprint(example_tweets[0])

NameError: name 'example_tweets' is not defined

### Task 2.2
To be consistent we are going to use the same dataset no matter what you get from your twitter api. So from this point on, if you are working as a group or individually, be sure to use the data sets provided to you in the zip file. There should be two json files inside your data folder. One is '2017-2018.json', the other one is '2016-2017.json'. We will load the '2017-2018.json' first.

In [None]:
def load_tweets(path):
    """Loads tweets that have previously been saved.
    
    Calling load_tweets(path) after save_tweets(tweets, path)
    will produce the same list of tweets.
    
    Args:
        path (str): The place where the tweets will be saved.

    Returns:
        list: A list of Dictionary objects, each representing one tweet."""
    
    with open(path, "rb") as f:
        import json
        return json.load(f)

In [None]:
dest_path = 'data/2017-2018.json'
trump_tweets = load_tweets(dest_path)
len(trump_tweets)

If everything is working correctly correctly this should load roughly the last 3000 tweets by `realdonaldtrump`.

In [None]:
assert 2000 <= len(trump_tweets) <= 4000

If the assert statement above works, then continue on to task 2.3.

### Task 2.3

Find the number of the month of the oldest tweet.

In [None]:
# Enter the number of the month of the oldest tweet (e.g. 1 for January)

oldest_month = 10
trump_tweets = pd.DataFrame(trump_tweets)

### BEGIN ANSWER
trump_tweets.sort_values('id',ascending = False)
    # your solution here

### END ANSWER

## PART 3  Twitter Source Analysis (group and individual)



### Task 3.1

Create a new data frame from `2016-2017.json` and merge with `trump_tweets` 

**Important:** There may/will be some overlap so be sure to __eliminate duplicate tweets__. If you do not eliminate the duplicates properly, your results might not be compatible with the test solution. 
**Hint:** the `id` of a tweet is always unique.

In [None]:
# if you do not have new tweets, then all_tweets is the same as  old_trump_tweets

### BEGIN ANSWER
dest_path1 = 'data/2016-2017.json'
trump_tweets1 = load_tweets(dest_path1)
trump_tweets1 = pd.DataFrame(trump_tweets1)
# trump_tweets1
trump_tweets1.rename({'text': 'full_text'}, axis='columns',inplace = True)
all_tweets = pd.concat([trump_tweets,trump_tweets1])
all_tweets['id'] = all_tweets['id'].astype('int64')
# all_tweets.drop_duplicates(subset=['id'])
# all_tweets.sort_values(by = ["id"], inplace=True)
all_tweets.drop_duplicates(subset=['id'], inplace=True)
all_tweets
# all_tweets.loc[all_tweets['full_text']=='']
### END ANSWER
# assert(all_tweets.size == 321408) 


### Task 3.2
Construct a DataFrame called `df_trump` containing all the tweets stored in `all_tweets`. The index of the dataframe should be the ID of each tweet (looks something like `907698529606541312`). It should have these columns:

- `time`: The time the tweet was created encoded as a datetime object. (Use `pd.to_datetime` to encode the timestamp.)
- `source`: The source device of the tweet.
- `text`: The text of the tweet.
- `retweet_count`: The retweet count of the tweet. 

Finally, **the resulting dataframe should be sorted by the index.**

**Warning:** *Some tweets will store the text in the `text` field and other will use the `full_text` field.*

**Warning:** *Don't forget to check the type of index*

In [None]:
### BEGIN ANSWER
df_trump = all_tweets[['id','created_at', 'source', 'full_text', 'retweet_count']].copy()
df_trump = df_trump.rename(columns={"created_at": "time"})
df_trump['time'] = pd.to_datetime(df_trump['time'],format = '%a %b %d %H:%M:%S %z %Y')
df_trump['id'] = df_trump['id'].astype(int)
df_trump = df_trump.set_index('id')
df_trump
print(len(df_trump))

### END ANSWER

In the following questions, we are going to find out the charateristics of Trump tweets and the devices used for the tweets.

First let's examine the source field:

In [None]:
df_trump['source'].unique()

## Task 3.3

Remove the HTML tags from the source field. 

**Hint:** Use `df_trump['source'].str.replace` and your favorite regular expression.

In [None]:
import re
### BEGIN ANSWER
df_trump['source'] = df_trump['source'].str.replace('<.*?>',  '', regex = True)
# df_trump.dtypes
### END ANSWER

### Make a plot to find out the most common device types used in accessing twitter

Sort the plot in decreasing order of the most common device type

In [None]:
### BEGIN ANSWER
list1 = df_trump['source'].value_counts().keys().tolist()
list2 = df_trump['source'].value_counts().tolist()
d={'Source':list1,'Values':list2}
plot_df = pd.DataFrame(d)
#list2 =  (calls['OFNS_DESC'].value_counts().keys())
ax = plot_df.plot.bar(x='Source', y='Values')

### END ANSWER

### Task 3.4
Is there a difference between his Tweet behavior across these devices? We will attempt to answer this question in our subsequent analysis.

First, we'll take a look at whether Trump's tweets from an Android come at different times than his tweets from an iPhone. Note that Twitter gives us his tweets in the [UTC timezone](https://www.wikiwand.com/en/List_of_UTC_time_offsets) (notice the `+0000` in the first few tweets)

**Note** - If your `time` column is not in datetime format, the following code will not work.

In [None]:
df_trump['time'][0:3]

We'll convert the tweet times to US Eastern Time, the timezone of New York and Washington D.C., since those are the places we would expect the most tweet activity from Trump.

In [None]:
df_trump['est_time'] = (
    df_trump['time'] # Set initial timezone to UTC
                 .dt.tz_convert("EST") # Convert to Eastern Time
)
df_trump.head()

**What you need to do:**

Add a column called `hour` to the `df_trump` table which contains the hour of the day as floating point number computed by:

$$
\text{hour} + \frac{\text{minute}}{60} + \frac{\text{second}}{60^2}
$$

In [None]:
df_trump['hour'] = df_trump['est_time'].dt.hour + (df_trump['time'].dt.minute) / 60 + (df_trump['time'].dt.second / 3600)
df_trump['roundhour']=round(df_trump['hour'])
# df_trump['hour']
# df_trump['roundhour']
# df_trump.dtypes

In [None]:
assert np.isclose(df_trump.loc[690171032150237184]['hour'], 8.93639)

Use the `roundhour` column and plot the number of tweets at every hour of the day.
Order the plot using the hour of the day (1 to 24). Use seaborn `countplot`

In [None]:
# make a bar plot here
### BEGIN ANSWER
plot1 = sns.countplot(x ='roundhour', data = df_trump)
plot1.set_xticklabels(plot1.get_xticklabels(), rotation=90)
### END ANSWER

Now, use this data along with the seaborn `distplot` function to examine the distribution over hours of the day in eastern time that trump tweets on each device for the 2 most commonly used devices.  Your plot should look somewhat similar to the following.
<img src="images/device_hour2.png" align="left" alt="Drawing" style="width: 400px;"/>


In [None]:
### BEGIN ANSWER
fig = plt.figure(figsize=(14,10));
iphone = df_trump.loc[df_trump['source'] =='Twitter for iPhone' ]
sns.distplot(iphone['hour'], hist=False, label="iphone")
android = df_trump.loc[df_trump['source'] =='Twitter for Android']
sns.distplot(android['hour'], hist=False, label="android")
plt.xlabel('Hour', fontsize=15);
plt.ylabel('Fraction', fontsize=15);
fig.legend(labels=['iphone','android']);
### END ANSWER

### Task 3.5

According to [this Verge article](https://www.theverge.com/2017/3/29/15103504/donald-trump-iphone-using-switched-android), Donald Trump switched from an Android to an iPhone sometime in March 2017.

Create a figure identical to your figure from 3.4, except that you should show the results only from 2016. If you get stuck consider looking at the `year_fraction` function from the next problem.

Use this data along with the seaborn `distplot` function to examine the distribution over hours of the day in eastern time that trump tweets on each device for the 2 most commonly used devices.  Your plot should look somewhat similar to the following. 

During the campaign, it was theorized that Donald Trump's tweets from Android were written by him personally, and the tweets from iPhone were from his staff. Does your figure give support the theory?

Response: In 2016, the time allocation for the usage of the iphone centered in the afternoon, while his tweets from 2015 to present shows that he mostly tweets in the morning. It seems that the tweets from iphone in 2016 were from his staff, not himself.

\\
<img src="images/device_hour2.png" align="left" alt="Drawing" style="width: 600px;"/>


In [1]:
### BEGIN ANSWER
fig = plt.figure(figsize=(14,10));
one_year = df_trump.loc[df_trump['time'].dt.year == 2016]
one_year
iphone = one_year.loc[one_year['source'] =='Twitter for iPhone' ]
sns.distplot(iphone['hour'], hist=False, label="iphone")
android = one_year.loc[one_year['source'] =='Twitter for Android']
sns.distplot(android['hour'], hist=False, label="android")
plt.xlabel('Hour', fontsize=15);
plt.ylabel('Fraction', fontsize=15);
fig.legend(labels=['iphone','android']);
### END ANSWER

NameError: name 'plt' is not defined

### Task 3.6
Edit this cell to answer the following questions.
* What time of the day the Android tweets were made by Trump himself? (eg: morning, late night etc)
    
        The afternoon into evening seems to be when Donald Trump was active on twitter

* What time of the day the Android tweets were made by paid staff?
        
        The middle of the night into evening seems to be when the paid staff was active on twitter

Note that these are speculations based on what you observe in the data set.

### Task 3.7 Device Analysis
Let's now look at which device he has used over the entire time period of this dataset.

To examine the distribution of dates we will convert the date to a fractional year that can be plotted as a distribution.

(Code borrowed from https://stackoverflow.com/questions/6451655/python-how-to-convert-datetime-dates-to-decimal-years)

In [None]:
import datetime
def year_fraction(date):
    start = datetime.date(date.year, 1, 1).toordinal()
    year_length = datetime.date(date.year+1, 1, 1).toordinal() - start
    return date.year + float(date.toordinal() - start) / year_length


df_trump['year'] = df_trump['time'].apply(year_fraction) #should be df_trump

Use the `sns.distplot` to overlay the distributions of the 2 most frequently used web technologies over the years.  Your final plot should be similar to:

![title](images/source_years.png)

In [None]:
#plt.figure(figsize=(15,15))
### BEGIN ANSWER
fig = plt.figure(figsize=(14,10));
iphone = df_trump.loc[df_trump['source'] =='Twitter for iPhone' ]
sns.distplot(iphone['year'], hist=True, label="iphone")
android = df_trump.loc[df_trump['source'] =='Twitter for Android']
sns.distplot(android['year'], hist=True, label="android")
fig.legend(labels=['iphone','android']);
### END ANSWER

## PART 4 - Sentiment Analysis  (group and individual)

It turns out that we can use the words in Trump's tweets to calculate a measure of the sentiment of the tweet. For example, the sentence "I love America!" has positive sentiment, whereas the sentence "I hate taxes!" has a negative sentiment. In addition, some words have stronger positive / negative sentiment than others: "I love America." is more positive than "I like America."

We will use the [VADER (Valence Aware Dictionary and sEntiment Reasoner)](https://github.com/cjhutto/vaderSentiment) lexicon to analyze the sentiment of Trump's tweets. VADER is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media which is great for our usage.

The VADER lexicon gives the sentiment of individual words. Run the following cell to show the first few rows of the lexicon:

In [None]:
print(''.join(open("data/vader_lexicon.txt").readlines()[:10]))

### Task 4.1

As you can see, the lexicon contains emojis too! The first column of the lexicon is the *token*, or the word itself. The second column is the *polarity* of the word, or how positive / negative it is.

(How did they decide the polarities of these words? What are the other two columns in the lexicon? See the link above.)

 Read in the lexicon into a DataFrame called `df_sent`. The index of the DF should be the tokens in the lexicon. `df_sent` should have one column: `polarity`: The polarity of each token.

In [None]:
### BEGIN ANSWER
df_sent = pd.read_csv("data/vader_lexicon.txt", sep="\t", header=None)
df_sent = df_sent.set_index(df_sent.columns[0])
df_sent = df_sent.drop(columns=[2,3])
df_sent = df_sent.rename(columns = {1: 'polarity'})
df_sent
### END ANSWER

### Task 4.2

Now, let's use this lexicon to calculate the overall sentiment for each of Trump's tweets. Here's the basic idea:

1. For each tweet, find the sentiment of each word.
2. Calculate the sentiment of each tweet by taking the sum of the sentiments of its words.

First, let's lowercase the text in the tweets since the lexicon is also lowercase. Set the `text` column of the `df_trump` DF to be the lowercased text of each tweet.

In [None]:
### BEGIN ANSWER
df_trump['full_text'] = df_trump['full_text'].str.lower()
df_trump
### END ANSWER

### Task 4.3

Now, let's get rid of punctuation since it'll cause us to fail to match words. Create a new column called `no_punc` in the `df_trump` to be the lowercased text of each tweet with all punctuation replaced by a single space. We consider punctuation characters to be any character that isn't a Unicode word character or a whitespace character. You may want to consult the Python documentation on regexes for this problem.

(Why don't we simply remove punctuation instead of replacing with a space? See if you can figure this out by looking at the tweet data.)

In [None]:
# Save your regex in punct_re
punct_re = r'[^\w\s\\n]'

### BEGIN ANSWER
df_trump['no_punc'] = df_trump['full_text'].str.replace(punct_re, '', regex = True)
# df_trump
### END ANSWER

In [None]:
assert isinstance(punct_re, str)
assert re.search(punct_re, 'this') is None
assert re.search(punct_re, 'this is ok') is None
assert re.search(punct_re, 'this is\nok') is None
assert re.search(punct_re, 'this is not ok.') is not None
assert re.search(punct_re, 'this#is#ok') is not None
assert re.search(punct_re, 'this^is ok') is not None
assert df_trump['no_punc'].loc[800329364986626048] == 'i watched parts of nbcsnl saturday night live last night it is a totally onesided biased show  nothing funny at all equal time for us'
assert df_trump['full_text'].loc[884740553040175104] == 'working hard to get the olympics for the united states (l.a.). stay tuned!'

### Task 4.4


Now, let's convert the tweets into what's called a [*tidy format*](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) to make the sentiments easier to calculate. Use the `no_punc` column of `df_trump` to create a table called `tidy_format`. The index of the table should be the IDs of the tweets, repeated once for every word in the tweet. It has two columns:

1. `num`: The location of the word in the tweet. For example, if the tweet was "i love america", then the location of the word "i" is 0, "love" is 1, and "america" is 2.
2. `word`: The individual words of each tweet.

The first few rows of our `tidy_format` table look like:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>num</th>
      <th>word</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>894661651760377856</th>
      <td>0</td>
      <td>i</td>
    </tr>
    <tr>
      <th>894661651760377856</th>
      <td>1</td>
      <td>think</td>
    </tr>
    <tr>
      <th>894661651760377856</th>
      <td>2</td>
      <td>senator</td>
    </tr>
    <tr>
      <th>894661651760377856</th>
      <td>3</td>
      <td>blumenthal</td>
    </tr>
    <tr>
      <th>894661651760377856</th>
      <td>4</td>
      <td>should</td>
    </tr>
  </tbody>
</table>

You can double check that your tweet with ID `894661651760377856` has the same rows as ours. Our tests don't check whether your table looks exactly like ours.

As usual, try to avoid using any for loops. Our solution uses a chain of 5 methods on the 'trump' DF, albeit using some rather advanced Pandas hacking.

* **Hint 1:** Try looking at the `expand` argument to pandas' `str.split`.

* **Hint 2:** Try looking at the `stack()` method.

* **Hint 3:** Try looking at the `level` parameter of the `reset_index` method.

In [None]:
#tidy_format = ...

### BEGIN ANSWER
tidy_format =df_trump['no_punc'].str.split(expand = True).stack().reset_index()
tidy_format = tidy_format.set_index('id')
tidy_format.columns = ['num', 'word']
tidy_format.dtypes
### END ANSWER

In [None]:
assert tidy_format.loc[894661651760377856].shape == (27, 2)
assert ' '.join(list(tidy_format.loc[894661651760377856]['word'])) == 'i think senator blumenthal should take a nice long vacation in vietnam where he lied about his service so he can at least say he was there'

### Task 4.5

Now that we have this table in the tidy format, it becomes much easier to find the sentiment of each tweet: we can join the table with the lexicon table. 

Add a `polarity` column to the `df_trump` table.  The `polarity` column should contain the sum of the sentiment polarity of each word in the text of the tweet.

**Hint** you will need to merge the `tidy_format` and `df_sent` tables and group the final answer.


In [None]:
#df_trump['polarity'] = ...

### BEGIN ANSWER
# df_trump['polarity'] = tidy_format.merge(df_sent, how = "inner", left_on = "word", right_index=True).groupby('id').sum()
df_trump['polarity'] = tidy_format.merge(df_sent, how = "inner", left_on = "word", right_index=True).groupby('id').sum()[['polarity']]
df_trump
df_trump = df_trump.fillna(value=0)
df_trump

### END ANSWER

In [None]:
assert np.allclose(df_trump.loc[744701872456536064, 'polarity'], 8.4)
assert np.allclose(df_trump.loc[745304731346702336, 'polarity'], 2.5)
assert np.allclose(df_trump.loc[744519497764184064, 'polarity'], 1.7)
assert np.allclose(df_trump.loc[894661651760377856, 'polarity'], 0.2)
assert np.allclose(df_trump.loc[894620077634592769, 'polarity'], 5.4)
# If you fail this test, you dropped tweets with 0 polarity
#assert np.allclose(df_trump.loc[744355251365511169, 'polarity'], 0.0)


### Task 4.6
Now we have a measure of the sentiment of each of his tweets! You can read over the VADER readme to understand a more robust sentiment analysis.
Now, write the code to see the most positive and most negative tweets from Trump in your dataset:
Find the most negative and most positive tweets made by Trump

In [None]:
print('Most negative tweets:')

### BEGIN ANSWER
for t in df_trump.sort_values('polarity').head()['full_text']:
    print('\n  ', t)

### END ANSWER

In [None]:
print('Most positive tweets:')

### BEGIN ANSWER
for t in df_trump.sort_values('polarity', ascending = False).head()['full_text']:
    print('\n  ', t)
### END ANSWER

### Task 4.7
Plot the distribution of tweet sentiments broken down by whether the text of the tweet contains `nyt` or `fox`.  Then in the box below comment on what we observe?

![title](images/nyt_vs_fox.png)

In [None]:
### BEGIN ANSWER
fox = df_trump[df_trump['no_punc'].str.contains('fox')]['polarity']
nyt = df_trump[df_trump['no_punc'].str.contains('nyt')]['polarity']
sns.distplot(nyt, label = 'nyt')
sns.distplot(fox, label = 'fox')

### END ANSWER

##### Comment on what you observe:

#### BEGIN ANSWER


#### END ANSWER

## PART 5 - Principal Component Analysis (PCA) and Twitter  (group and individual)
A look at the top words used and the sentiments expressed in Trump tweets indicates that, some words are used with others almost all the time. A notable example is the slogan like Make America Great Again. As such, it may be beneficial to look at groups of words rather than individual words. For that, we will look at an approach applying a Principal Component Analysis. 

### The PCA
The Principal Component Analysis, or PCA, is a tool generally used to identify patterns and to reduce the number of variables you have to consider in your analysis. For example, if you have data with 200 columns, it may be that a significant amount of the variance in your data can be explained by just 100 principal components. In the PCA, the first component is chosen in such a way that has the largest variance, subsequent components are orthogonal and continue covering as much variance as possible. In this way, the PCA samples as much of the variability in the data set with the first few components. Mathematically, each component is a linear combination of all the input parameters times coefficients specific for that component. These coefficients, or loading factors, are constrained such that the sum of the squares of them are equal to 1. As such, the loading factors serve as weights describing how strongly certain parameters contribute to the specific principal component. Parameters with large values of positive or negative loading factors are correlated with each other, which can serve to identify trends in your data.

import nltk
import nltk.corpus
nltk.download('stopwords')
nltk.download('wordnet')### Task 5.1 Cleaning up the Data
Using NLTK (Natural Language Toolkit) package for language processing and other python libraries, parse the json file to deal with inflected words, such as plurals, and removed stop words like common English words (the, and, it, etc) and certain political terms (the candidates names, for example). You can start with the top 50 words, but full analysis may require large number of words.
Create a document-frequecy (df) matrix with 5000 rows and 50 columns where each column is a particular word (feature) and each row is a tweet (observation). The values of the matrix is how often the word appears. Apply the techniques we learned to reduce the weight of most common words (if necessary). Since this is a sparse matrix, you can use the sparse martix libraries to make things a bit more efficient (we can also use a regular numpy arrays to store these things since the dimensions are not too large). Lecture 6.1 captures some sparse matrix routines you can use.
Print the first 10 rows of the df to show the matrix you created

Start with the `tidy_format` dataframe

In [None]:
import nltk
import nltk.corpus
nltk.download('stopwords')
nltk.download('wordnet')

In [None]:
## code to plot the first 10 rows of the matrix
#create a dataframe called tmp to store all words appear in the tweets
tmp = tidy_format.drop('num',axis=1)

#remove stopwords
stopwords = nltk.corpus.stopwords.words("english")
stopwords.extend(['rt','t','co','https','realdonaldtrump','amp',"u",'hillary','trump2016','trump','clinton','http','ha','wa'])
tmp = tmp[~tmp['word'].isin(stopwords)]

#deal with plurals
from nltk.stem.wordnet import WordNetLemmatizer
Lem = WordNetLemmatizer()
def lem(x):
    return Lem.lemmatize(x)
tmp['word'] = tmp.word.apply(lem)

# Remove numbers
tmp = tmp[~(tmp['word'].str.isnumeric())]

#Remove words with only 1 or 2 length
tmp = tmp[(tmp['word'].str.len() > 2)]

#get top50 words
tmp = tmp.reset_index()
top50 = tmp['word'].value_counts(ascending=False).nlargest(50).to_frame()

tmp2 = tmp[tmp['word'].isin(top50.reset_index()['index'])]

idlist = tmp2['id'].unique()
idlist.sort()
idlist = idlist[:5000]

#create the tf-matrix
matrix = np.zeros((5000,50))
words = top50.index
top50 = top50.reset_index()
for a in range(5000):
    for b in range(50):
        if (top50['index'][b]) in df_trump['no_punc'].loc[idlist[a]]:
            matrix[a][b] += 1

print(matrix[:10])
top50


### Task 5.2 Find the PCA's
Write the code to find the first 50 PCA's for the document-frequency matrix. Pass the document-term-matrix to scikit-learn’s (https://scikit-learn.org/stable/modules/decomposition.html#decompositions) PCA method to obtain the components and loading factors.

In [None]:
### BEGIN ANSWER
from sklearn.decomposition import PCA

pca = PCA()
pca.fit(matrix)

print(pca.components_)
# len(pca.components_)
print(pca.explained_variance_)
### END ANSWER

### Task 5.3 Examine the PCA
We can examine the PCA results to look at the heatmap. Make a grid plot which shows the various principal component along the x-axis and the individual words along the y-axes. Each grid box should be color-coded based on the sign of the loading factor and how large the square of that value is. Looking at it vertically, you can see which words constitute your principal components. Looking at it horizontally, you can see how individual terms are shared between components. 

![title](images/pca.png)



In [None]:
### BEGIN ANSWER
pca_map= pd.DataFrame(pca.components_, columns=words)
pca_map_flipped = pca_map.transpose()
plt.figure(figsize=(20,15))
sns.heatmap(pca_map_flipped, cmap="BrBG", vmin= -0.8, vmax=0.8, xticklabels=["PC" + str(x) for x in range(1,51)], yticklabels=True)
### END ANSWER

### Task 5.4 PCA Compare
We can determine how many words and how many components are needed to do a good visualization. Plot PC1 and PC2 in a 2D plot. The results should be similar to following scatter plot 

![title](images/PC1_PC2.png)

This is a scatter plot of the values of the components, but with arrows indicating some of the prominent terms as indicated by their loading factors. The values of the loading factors are used to determine the length and direction of these arrows and as such they serve as a way of expressing direction. That is, tweets which use these terms will be moved along the length of those arrows. Shown are the most important parameters.

In [None]:
### BEGIN ANSWER
pca1 = pca.components_[:,0]
pca2 = pca.components_[:,1]
joint = sns.jointplot(pca1, pca2)
joint.set_axis_labels('x', 'y', fontsize=16)
joint.ax_joint.set_xlabel('PC1');
joint.ax_joint.set_ylabel('PC2');
### END ANSWER

## PART 6 - Twitter Engagement

In this problem, we'll explore which words led to a greater average number of retweets. For example, at the time of this writing, Donald Trump has two tweets that contain the word 'oakland' (tweets 932570628451954688 and 1016609920031117312) with 36757 and 10286 retweets respectively, for an average of 23,521.5.


Your `top_20` table should have this format:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>retweet_count</th>
    </tr>
    <tr>
      <th>word</th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>jong</th>
      <td>40675.666667</td>
    </tr>
    <tr>
      <th>try</th>
      <td>33937.800000</td>
    </tr>
    <tr>
      <th>kim</th>
      <td>32849.595745</td>
    </tr>
    <tr>
      <th>un</th>
      <td>32741.731707</td>
    </tr>
    <tr>
      <th>maybe</th>
      <td>30473.192308</td>
    </tr>
  </tbody>
</table>

### Task 6.1
Find the top 20 most retweeted words. Include only words that appear in at least 25 tweets. As usual, try to do this without any for loops. You can string together ~5-7 pandas commands and get everything done on one line.

In [None]:
#top_20 = ...
### BEGIN ANSWER
top_20 = tidy_format.groupby('word')
top_20 = top_20.filter(lambda x: len(x) > 25)
top_20 = top_20.merge(df_trump, how='inner', left_index=True, right_index=True)
top_20 = top_20.groupby('word').agg({'retweet_count': 'mean'})
top_20 = top_20.sort_values(by='retweet_count', ascending=False)
top_20 = top_20[:20]
top_20
### END ANSWER

### Task 6.2
Plot a bar chart of your results:

In [None]:
### BEGIN ANSWER
bar_plot = top_20['retweet_count'].sort_values(ascending=True)
bar_plot.plot.barh(figsize=(10, 8));

### END ANSWER

## PART 7 - Kim Jong Un and Musk Tweet Analysis (Optional for Individual)
What else can we do? Let us ask some open ended questions.

### Task 7.1
"kim", "jong" and "un" are apparently really popular in Trump's tweets! It seems like we can conclude that his tweets involving jong are more popular than his other tweets. Or can we?

Consider each of the statements about possible confounding factors below. State whether each statement is true or false and explain. If the statement is true, state whether the confounding factor could have made kim jong un related tweets higher in the list than they should be.

1. We didn't restrict our word list to nouns, so we have unhelpful words like "let" and "any" in our result.
      - That might be why 'un' is the most popular.
1. We didn't remove hashtags in our text, so we have duplicate words (eg. #great and great).
      - Some may only have '#great' not 'great' which make the average lower
1. We didn't account for the fact that Trump's follower count has increased over time.
      - This can affect a lot. As Trump's follower count has increased, the more popular every word be

In [None]:
#plt.figure(figsize=(20,20))

### BEGIN ANSWER
   
# your solution here

### END ANSWER

Created by Andy Guna @2019-2022 Credits: Josh Hug, and Berkeley Data Science Group, Steve Skiena, David Rodreguez

@ Copyrighted Material. DO NOT post online.