## Set up
Let us get all the libaries initialized as necessary

In [None]:
# Run this cell to set up your notebook
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import zipfile
import json

# Ensure that Pandas shows at least 280 characters in columns, so we can see full tweets
pd.set_option('max_colwidth', 280)

%matplotlib inline
plt.style.use('fivethirtyeight')
import seaborn as sns
sns.set()
sns.set_context("talk")
import re

## Downloading Recent Tweets
It is important to download the most recent tweets (especially if you are working as a group). Those who are working by themselves are allowed to use the downloaded files w/o setting up access to any twitter API (which can sometime be bit complicated). Twitter provides the API Tweepy (http://www.tweepy.org/) that makes it easy to access twitter content that is publicly available. We will also provide example code as needed.

In [None]:
pip install tweepy

In [None]:
## Make sure you have set up tweepy if you are working locally.
# https://www.pythoncentral.io/introduction-to-tweepy-twitter-for-python/
# After set up, the following should run:
import tweepy

## PART 1:  Accessing Twitter API  (optional for individuals)
In order to access Twitter API, you need to get keys by signing up as a Twitter developer. We will walk you through this process. 
* if you are working by yourself on this project, you can skip PART 1, and complete the project using the data files provided in the data folder. PART 1 is optional for those working by themselves. However, we highly recommend that you do Part 1 (after completing the project with offline data) if you would like to "learn" how to use Twitter API that might be useful.

### Task 1.1

Follow the instructions below to get your Twitter API keys.  **Read the instructions completely before starting.**

1. [Create a Twitter account](https://twitter.com/).  You can use an existing account if you have one; if you prefer to not do this assignment under your regular account, feel free to create a throw-away account.
2. Under account settings, add your phone number to the account.
3. [Create a Twitter developer account](https://developer.twitter.com/en/apply/) by clicking the 'Apply' button on the top right of the page. Attach it to your Twitter account. You'll have to fill out a form describing what you want to do with the developer account. Explain that you are doing this for a class at Rutgers University and that you don't know exactly what you're building yet and just need the account to get started. These applications are approved by some sort of AI system, so it doesn't matter exactly what you write. Just don't enter a bunch of alweiofalwiuhflawiuehflawuihflaiwhfe type stuff or you might get rejected.
4. Once you're logged into your developer account, [create an application for this assignment](https://apps.twitter.com/app/new).  You can call it whatever you want, and you can write any URL when it asks for a web site.  You don't need to provide a callback URL.
5. On the page for that application, find your Consumer Key and Consumer Secret.
6. On the same page, create an Access Token.  Record the resulting Access Token and Access Token Secret.
7. Edit the file [keys.json](keys.json) and replace the placeholders with your keys.

## WARNING (Please Read) !!!!


### Protect your Twitter Keys
<span style="color:red">
If someone has your authentication keys, they can access your Twitter account and post as you!  So don't give them to anyone, and **don't write them down in this notebook**. 
</span>
The usual way to store sensitive information like this is to put it in a separate file and read it programmatically.  That way, you can share the rest of your code without sharing your keys.  That's why we're asking you to put your keys in `keys.json` for this assignment.


### Avoid making too many API calls.

<span style="color:red">
Twitter limits developers to a certain rate of requests for data.  If you make too many requests in a short period of time, you'll have to wait awhile (around 15 minutes) before you can make more.  </span> 
So carefully follow the code examples you see and don't rerun cells without thinking.  Instead, always save the data you've collected to a file.  We've provided templates to help you do that.


### Be careful about which functions you call!

<span style="color:red">
This API can retweet tweets, follow and unfollow people, and modify your twitter settings.  Be careful which functions you invoke! </span> It is possible that you can accidentally re-tweet some tweets because you typed `retweet` instead of `retweet_count`. 
</span>


In [None]:
import json
key_file = 'keys.json'
# Loading your keys from keys.json (which you should have filled
# in in question 1):
with open(key_file) as f:
    keys = json.load(f)
# if you print or view the contents of keys be sure to delete the cell!

### Task 1.2 Testing Twitter Authentication
This following code should run w/o erros or warnings and display Rutgers University's twitter username

In [None]:
import tweepy
from tweepy import TweepyException
import logging

try:
    auth = tweepy.OAuthHandler(keys["consumer_key"], keys["consumer_secret"])
    redirect_url = auth.get_authorization_url()
    auth.set_access_token(keys["access_token"], keys["access_token_secret"])
    api = tweepy.API(auth)
    print("Rutgers username is:", api.get_user(screen_name="RutgersU").name)
except TweepyException as e:
    logging.warning("There was a Tweepy error. Double check your API keys and try again.")
    logging.warning(e)

## PART 2 - Working with Twitter
The json file in data folder contains (to be downloaded by you) some loaded tweets from @RutgersU. Run it and read the code. You can also try other json files in the data folder to try this. 

In [None]:
from pathlib import Path
import json

ds_tweets_save_path = "data/RutgersU_recent_tweets.json"   # need to get this file

# Guarding against attempts to download the data multiple
# times:
if not Path(ds_tweets_save_path).is_file():
    # Getting as many recent tweets by @RutgersU as Twitter will let us have.
    # We use tweet_mode='extended' so that Twitter gives us full 280 character tweets.
    # This was a change introduced in September 2017.
    
    # The tweepy Cursor API actually returns "sophisticated" Status objects but we 
    # will use the basic Python dictionaries stored in the _json field. 
    example_tweets = [t._json for t in tweepy.Cursor(api.user_timeline, screen_name="RutgersU", 
                                             tweet_mode='extended').items()]
    
    # Saving the tweets to a json file on disk for future analysis
    with open(ds_tweets_save_path, "w") as f:        
        json.dump(example_tweets, f)

# Re-loading the json file:
with open(ds_tweets_save_path, "r") as f:
    example_tweets = json.load(f)

If things ran as expected, you should be able to look at the first tweet by running the code below. It probabably does not make sense to view all tweets in a notebook, as size of the tweets can freeze your browser (always a good idea to press ctrl-S to save the latest, in case you have to restart Jupyter)

In [None]:
# Looking at one tweet object, which has type Status: 
from pprint import pprint # ...to get a more easily-readable view.
pprint(example_tweets[0])

### Task 2.1 (Optional for Individuals)

### What you need to do. 

Re-factor the above code fragment into reusable snippets below.  You should not need to make major modifications; this is mostly an exercise in understanding the above code block. 

In [None]:
def load_keys(path):
    """Loads your Twitter authentication keys from a file on disk.
    
    Args:
        path (str): The path to your key file.  The file should
          be in JSON format and look like this (but filled in):
            {
                "consumer_key": "<your Consumer Key here>",
                "consumer_secret":  "<your Consumer Secret here>",
                "access_token": "<your Access Token here>",
                "access_token_secret": "<your Access Token Secret here>"
            }
    
    Returns:
        dict: A dictionary mapping key names (like "consumer_key") to
          key values."""
    
    ### BEGIN SOLUTION
   
    dict={}
    import json
    import tweepy
    import logging
    from tweepy import TweepyException
    with open(path) as f:
        key=json.load(f)
    dict['consumer_key']=key['consumer_key']
    dict['consumer_secret']=key['consumer_secret']
    dict['access_token']=key['access_token']
    dict['access_token_secret']=key['access_token_secret']
    return dict

    ### END SOLUTION

In [None]:
import json
import tweepy
import logging
from tweepy import TweepyException
def download_recent_tweets_by_user(user_account_name, keys):
    """Downloads tweets by one Twitter user.

    Args:
        user_account_name (str): The name of the Twitter account
          whose tweets will be downloaded.
        keys (dict): A Python dictionary with Twitter authentication
          keys (strings), like this (but filled in):
            {
                "consumer_key": "<your Consumer Key here>",
                "consumer_secret":  "<your Consumer Secret here>",
                "access_token": "<your Access Token here>",
                "access_token_secret": "<your Access Token Secret here>"
            }

    Returns:
        list: A list of Dictonary objects, each representing one tweet."""

    ### BEGIN SOLUTION
    auth = tweepy.OAuthHandler(keys["consumer_key"], keys["consumer_secret"])
    redirect_url = auth.get_authorization_url()
    auth.set_access_token(keys["access_token"], keys["access_token_secret"])
    api = tweepy.API(auth)
    tweets=[t._json for t in tweepy.Cursor(api.user_timeline, screen_name=user_account_name, tweet_mode='extended').items()]
    with open('save.txt', "w") as f:
        json.dump(tweets, f)
    return tweets
    ### END SOLUTION

In [None]:
def load_tweets(path):
    """Loads tweets that have previously been saved.
    
    Calling load_tweets(path) after save_tweets(tweets, path)
    will produce the same list of tweets.
    
    Args:
        path (str): The place where the tweets were be saved.

    Returns:
        list: A list of Dictionary objects, each representing one tweet."""
    
    ### BEGIN SOLUTION
    tweets=[]
    with open(path,"r") as f:
        tweets.append(json.load(f))
    return tweets
    
    ### END SOLUTION

In [None]:
def get_tweets_with_cache(user_account_name, keys_path):
    """Get recent tweets from one user, loading from a disk cache if available.
    
    The first time you call this function, it will download tweets by
    a user.  Subsequent calls will not re-download the tweets; instead
    they'll load the tweets from a save file in your local filesystem.
    All this is done using the functions you defined in the previous cell.
    This has benefits and drawbacks that often appear when you cache data:
    
    +: Using this function will prevent extraneous usage of the Twitter API.
    +: You will get your data much faster after the first time it's called.
    -: If you really want to re-download the tweets (say, to get newer ones,
       or because you screwed up something in the previous cell and your
       tweets aren't what you wanted), you'll have to find the save file
       (which will look like <something>_recent_tweets.pkl) and delete it.
    
    Args:
        user_account_name (str): The Twitter handle of a user, without the @.
        keys_path (str): The path to a JSON keys file in your filesystem.
    """
    
    ### BEGIN SOLUTION
    keys=load_keys(keys_path)
    result=download_recent_tweets_by_user(user_account_name, keys)
    
    return result
    
    
    ### END SOLUTION

If everything was implemented correctly you should be able to obtain roughly the last 3000 tweets by @RutgersU. (This may take a few minutes)

In [None]:
# When you are done, run this cell to load @RutgersU 's tweets.
# Note the function get_tweets_with_cache.  You may find it useful
# later.
rutgers_tweets = get_tweets_with_cache("RutgersU", key_file)
print("Number of tweets downloaded:", len(rutgers_tweets))

### Task 2.2
To be consistent we are going to use the same dataset no matter what you get from your twitter api. So from this point on, if you are working as a group or individually, be sure to use the data sets provided to you in the zip file. There should be two json files inside your data folder. One is '2017-2018.json', the other one is '2016-2017.json'. We will load the '2017-2018.json' first.

In [None]:
def load_tweets(path):
    """Loads tweets that have previously been saved.
    
    Calling load_tweets(path) after save_tweets(tweets, path)
    will produce the same list of tweets.
    
    Args:
        path (str): The place where the tweets will be saved.

    Returns:
        list: A list of Dictionary objects, each representing one tweet."""
    
    with open(path, "rb") as f:
        import json
        return json.load(f)

In [None]:
dest_path = 'data/2017-2018.json'
trump_tweets = load_tweets(dest_path)

If everything is working correctly correctly this should load roughly the last 3000 tweets by `realdonaldtrump`.

In [None]:
assert 2000 <= len(trump_tweets) <= 4000

If the assert statement above works, then continue on to task 2.3.

### Task 2.3

Find the number of the month of the oldest tweet.

In [None]:
# Enter the number of the month of the oldest tweet (e.g. 1 for January)
#oldest_month = 10

import datetime
trump_tweets = pd.DataFrame(trump_tweets)
### BEGIN SOLUTION
number=len(trump_tweets)-1
early=trump_tweets.iloc[number]
early=early['created_at']
early
print('Since Oct 19 2017 is oldest tweet, number of month of oldest tweet is 10')
### END SOLUTION

## PART 3  Twitter Source Analysis



### Task 3.1

Create a new data frame from `2016-2017.json` and merge with `trump_tweets` 

**Important:** There may/will be some overlap so be sure to __eliminate duplicate tweets__. If you do not eliminate the duplicates properly, your results might not be compatible with the test solution. 
**Hint:** the `id` of a tweet is always unique.

In [None]:
# if you do not have new tweets, then all_tweets is the same as  old_trump_tweets

### BEGIN SOLUTION
all_tweets = load_tweets('data/2016-2017.json')
all_tweets = pd.DataFrame(all_tweets)

all_tweets
                                                      
### END SOLUTION 

#assert(all_tweets.size == ???) 


### Task 3.2
Construct a DataFrame called `df_trump` containing all the tweets stored in `all_tweets`. The index of the dataframe should be the ID of each tweet (looks something like `907698529606541312`). It should have these columns:

- `time`: The time the tweet was created encoded as a datetime object. (Use `pd.to_datetime` to encode the timestamp.)
- `source`: The source device of the tweet.
- `text`: The text of the tweet.
- `retweet_count`: The retweet count of the tweet. 

Finally, **the resulting dataframe should be sorted by the index.**

**Warning:** *Some tweets will store the text in the `text` field and other will use the `full_text` field.*

**Warning:** *Don't forget to check the type of index*

In [None]:
### BEGIN SOLUTION
###trump_Old_Tweets
df_trump = all_tweets
df_trump.sort_index()
df_trump.rename(columns={"created_at": "time"},inplace = 'True')
df_trump['time']=pd.to_datetime(df_trump['time'])
### END SOLUTION

In the following questions, we are going to find out the charateristics of Trump tweets and the devices used for the tweets.

First let's examine the source field:

In [None]:
df_trump['source'].unique()

## Task 3.3

Remove the HTML tags from the source field. 

**Hint:** Use `df_trump['source'].str.replace` and your favorite regular expression.

In [None]:
### BEGIN SOLUTION
import re
df_trump['source']=df_trump['source'].str.replace('<[^>]*>','',regex=True)
df_trump['source']
### END SOLUTION

### Make a plot to find out the most common device types used in accessing twitter

Sort the plot in decreasing order of the most common device type

In [None]:
### BEGIN SOLUTION
yp = df_trump['source'].value_counts()
ax = yp.plot.barh(figsize=(10,12))
ax.set_ylabel('Number of Tweets')
ax.invert_yaxis()
### END SOLUTION

### Task 3.4
Is there a difference between his Tweet behavior across these devices? We will attempt to answer this question in our subsequent analysis.

First, we'll take a look at whether Trump's tweets from an Android come at different times than his tweets from an iPhone. Note that Twitter gives us his tweets in the [UTC timezone](https://www.wikiwand.com/en/List_of_UTC_time_offsets) (notice the `+0000` in the first few tweets)

**Note** - If your `time` column is not in datetime format, the following code will not work.

In [None]:
df_trump['time'][0:3]

We'll convert the tweet times to US Eastern Time, the timezone of New York and Washington D.C., since those are the places we would expect the most tweet activity from Trump.

In [None]:
df_trump['est_time'] = (
    df_trump['time'] # Set initial timezone to UTC
                 .dt.tz_convert("EST") # Convert to Eastern Time
)
df_trump.head()

**What you need to do:**

Add a column called `hour` to the `df_trump` table which contains the hour of the day as floating point number computed by:

$$
\text{hour} + \frac{\text{minute}}{60} + \frac{\text{second}}{60^2}
$$

In [None]:
df_trump['hour'] = df_trump.est_time.apply(lambda x: x.hour + x.minute/60 + x.second/3600)
df_trump['roundhour']=round(df_trump['hour'])

In [None]:
assert np.isclose(df_trump.loc[690171032150237184]['hour'], 8.93639)


Use the `roundhour` column and plot the number of tweets at every hour of the day.
Order the plot using the hour of the day (1 to 24). Use seaborn `countplot`

In [None]:
# make a bar plot here
### BEGIN SOLUTION

plt.figure(figsize=(15,15))
ax = sns.countplot(x='roundhour', data=df_trump)
ax.set_title('Number of Calls by Day of Week')


### END SOLUTION

Now, use this data along with the seaborn `distplot` function to examine the distribution over hours of the day in eastern time that trump tweets on each device for the 2 most commonly used devices.  Your plot should look somewhat similar to the following. 
![device_hour2.png](attachment:device_hour2.png)


In [None]:
### BEGIN SOLUTION
### make your plot here
tar = df_trump.loc[df_trump['source'] == 'Twitter for iPhone']
tar1 = df_trump.loc[df_trump['source'] == 'Twitter for Android']
ax = sns.distplot(tar[['hour']], hist=False, label='iPhone')
ax = sns.distplot(tar1[['hour']], hist=False, label='Andriod')
ax.set(xlabel='Hour', ylabel='Fraction')
### END SOLUTION

### Task 3.5

According to [this Verge article](https://www.theverge.com/2017/3/29/15103504/donald-trump-iphone-using-switched-android), Donald Trump switched from an Android to an iPhone sometime in March 2017.

Create a figure identical to your figure from 3.4, except that you should show the results only from 2016. If you get stuck consider looking at the `year_fraction` function from the next problem.

Use this data along with the seaborn `distplot` function to examine the distribution over hours of the day in eastern time that trump tweets on each device for the 2 most commonly used devices.  Your plot should look somewhat similar to the following. 

During the campaign, it was theorized that Donald Trump's tweets from Android were written by him personally, and the tweets from iPhone were from his staff. Does your figure give support the theory?

Response: In 2016, the time allocation for the usage of the iphone centered in the afternoon, while his tweets from 2015 to present shows that he mostly tweets in the morning. It seems that the tweets from iphone in 2016 were from his staff, not himself.

![title](images/device_hour2.png)

In [None]:
### BEGIN SOLUTION

tweets_2016 = df_trump[df_trump['time'].dt.year == 2016]
print(tweets_2016.size)
android_2016 = tweets_2016[tweets_2016['source'] == 'Twitter for Android']
iphone_2016 = tweets_2016[tweets_2016['source'] == 'Twitter for iPhone']

ax = sns.distplot(iphone_2016['hour'], hist=False,label='iPhone') ##blue
ax = sns.distplot(android_2016['hour'], hist=False,label='Android') ##red
ax.set(xlabel='Hour',ylabel='Fraction')
ax.legend(loc='upper left', frameon=True)
### END SOLUTION

### Task 3.6
Edit this cell to answer the following questions.
* What time of the day the Android tweets were made by Trump himself? (eg: morning, late night etc)

Late Night

* What time of the day the Android tweets were made by paid staff?

Morning

Note that these are speculations based on what you observe in the data set.

### Task 3.7 Device Analysis
Let's now look at which device he has used over the entire time period of this dataset.

To examine the distribution of dates we will convert the date to a fractional year that can be plotted as a distribution.

(Code borrowed from https://stackoverflow.com/questions/6451655/python-how-to-convert-datetime-dates-to-decimal-years)

In [None]:
import datetime
def year_fraction(date):
    start = datetime.date(date.year, 1, 1).toordinal()
    year_length = datetime.date(date.year+1, 1, 1).toordinal() - start
    return date.year + float(date.toordinal() - start) / year_length


df_trump['year'] = df_trump['time'].apply(year_fraction) #should be df_trump

Use the `sns.distplot` to overlay the distributions of the 2 most frequently used web technologies over the years.  Your final plot should be similar to:

![source_years.png](attachment:source_years.png)

In [None]:
### BEGIN SOLUTION
#plt.figure(figsize=(15,15))
twitter = df_trump.loc[df_trump['source'] == 'Twitter for iPhone']
twitter1 = df_trump.loc[df_trump['source'] == 'Twitter for Android']
ax = sns.distplot(twitter[['year']], hist = True, label='iPhone')
ax = sns.distplot(twitter1[['year']],hist=True, label='Android')
ax.set(xlabel='Year',ylabel='Fraction')
ax.legend(loc='upper right', frameon=True)

### END SOLUTION

## PART 4 - Sentiment Analysis

It turns out that we can use the words in Trump's tweets to calculate a measure of the sentiment of the tweet. For example, the sentence "I love America!" has positive sentiment, whereas the sentence "I hate taxes!" has a negative sentiment. In addition, some words have stronger positive / negative sentiment than others: "I love America." is more positive than "I like America."

We will use the [VADER (Valence Aware Dictionary and sEntiment Reasoner)](https://github.com/cjhutto/vaderSentiment) lexicon to analyze the sentiment of Trump's tweets. VADER is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media which is great for our usage.

The VADER lexicon gives the sentiment of individual words. Run the following cell to show the first few rows of the lexicon:

In [None]:
print(''.join(open("data/vader_lexicon.txt").readlines()[:10]))

### Task 4.1

As you can see, the lexicon contains emojis too! The first column of the lexicon is the *token*, or the word itself. The second column is the *polarity* of the word, or how positive / negative it is.

(How did they decide the polarities of these words? What are the other two columns in the lexicon? See the link above.)

 Read in the lexicon into a DataFrame called `df_sent`. The index of the DF should be the tokens in the lexicon. `df_sent` should have one column: `polarity`: The polarity of each token.

In [None]:
### BEGIN SOLUTION
delim='\t'
hd=None
tr=True
data=pd.read_csv('data/vader_lexicon.txt', sep=delim, header=hd)
c1=data.iloc[:,0]
data['polarity']=data.iloc[:,1]
data.set_index(c1, inplace=tr)
data.rename_axis('token', inplace=tr)
data.drop(columns=[0, 1, 2, 3], inplace=tr)
df_sent=data
df_sent
### END SOLUTION

In [None]:
#Citations
"""
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

https://pynative.com/pandas-set-index/

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.set_names.html

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename_axis.html
"""

### Task 4.2

Now, let's use this lexicon to calculate the overall sentiment for each of Trump's tweets. Here's the basic idea:

1. For each tweet, find the sentiment of each word.
2. Calculate the sentiment of each tweet by taking the sum of the sentiments of its words.

First, let's lowercase the text in the tweets since the lexicon is also lowercase. Set the `text` column of the `df_trump` DF to be the lowercased text of each tweet.

In [None]:
### BEGIN SOLUTION
t='text'
fill=df_trump[t]
fill=fill.str.lower()
df_trump[t]=fill
### END SOLUTION

### Task 4.3

Now, let's get rid of punctuation since it'll cause us to fail to match words. Create a new column called `no_punc` in the `df_trump` to be the lowercased text of each tweet with all punctuation replaced by a single space. We consider punctuation characters to be any character that isn't a Unicode word character or a whitespace character. You may want to consult the Python documentation on regexes for this problem.

(Why don't we simply remove punctuation instead of replacing with a space? See if you can figure this out by looking at the tweet data.)

In [None]:
#Citations: https://pythonexamples.org/python-re-sub/
#https://www.pythondaddy.com/python/how-to-remove-punctuation-from-a-dataframe-in-pandas-and-python/

In [None]:
# Save your regex in punct_re
punct_re = r'[^\w\s\\n]'


### BEGIN SOLUTION
t='text'
n='no_punc'
df_trump[n]=df_trump[t]
df_trump[n]=df_trump[n].str.replace(punct_re, ' ')
df_trump
### END SOLUTION


In [None]:
assert isinstance(punct_re, str)
assert re.search(punct_re, 'this') is None
assert re.search(punct_re, 'this is ok') is None
assert re.search(punct_re, 'this is\nok') is None
assert re.search(punct_re, 'this is not ok.') is not None
assert re.search(punct_re, 'this#is#ok') is not None
assert re.search(punct_re, 'this^is ok') is not None
assert df_trump['no_punc'].loc[800329364986626048] == 'i watched parts of  nbcsnl saturday night live last night  it is a totally one sided  biased show   nothing funny at all  equal time for us '
assert df_trump['text'].loc[884740553040175104] == 'working hard to get the olympics for the united states (l.a.). stay tuned!'


### Task 4.4


Now, let's convert the tweets into what's called a [*tidy format*](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) to make the sentiments easier to calculate. Use the `no_punc` column of `df_trump` to create a table called `tidy_format`. The index of the table should be the IDs of the tweets, repeated once for every word in the tweet. It has two columns:

1. `num`: The location of the word in the tweet. For example, if the tweet was "i love america", then the location of the word "i" is 0, "love" is 1, and "america" is 2.
2. `word`: The individual words of each tweet.

The first few rows of our `tidy_format` table look like:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>num</th>
      <th>word</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>894661651760377856</th>
      <td>0</td>
      <td>i</td>
    </tr>
    <tr>
      <th>894661651760377856</th>
      <td>1</td>
      <td>think</td>
    </tr>
    <tr>
      <th>894661651760377856</th>
      <td>2</td>
      <td>senator</td>
    </tr>
    <tr>
      <th>894661651760377856</th>
      <td>3</td>
      <td>blumenthal</td>
    </tr>
    <tr>
      <th>894661651760377856</th>
      <td>4</td>
      <td>should</td>
    </tr>
  </tbody>
</table>

You can double check that your tweet with ID `894661651760377856` has the same rows as ours. Our tests don't check whether your table looks exactly like ours.

As usual, try to avoid using any for loops. Our solution uses a chain of 5 methods on the 'trump' DF, albeit using some rather advanced Pandas hacking.

* **Hint 1:** Try looking at the `expand` argument to pandas' `str.split`.

* **Hint 2:** Try looking at the `stack()` method.

* **Hint 3:** Try looking at the `level` parameter of the `reset_index` method.

In [None]:
tidy_format=df_trump.reset_index(drop=True)
abc=tidy_format['no_punc'].str.split(expand=True).stack()
abc=abc.unstack()
tidy_format=tidy_format.drop(['no_punc','source', 'text', 'time', 'retweet_count', 'in_reply_to_user_id_str', 'favorite_count', 'is_retweet', 'est_time', 'hour', 'roundhour', 'year'], axis=1)
tidy_format=tidy_format.join(abc)
tidy_format=tidy_format.set_index('id')r
tidy_format

In [None]:
#Citations: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html
#https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html
#https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html
#https://www.statology.org/pandas-merge-on-index/

In [None]:
assert tidy_format.loc[894661651760377856].shape == (27, 2)
assert ' '.join(list(tidy_format.loc[894661651760377856]['word'])) == 'i think senator blumenthal should take a nice long vacation in vietnam where he lied about his service so he can at least say he was there'

### Task 4.5

Now that we have this table in the tidy format, it becomes much easier to find the sentiment of each tweet: we can join the table with the lexicon table. 

Add a `polarity` column to the `df_trump` table.  The `polarity` column should contain the sum of the sentiment polarity of each word in the text of the tweet.

**Hint** you will need to merge the `tidy_format` and `df_sent` tables and group the final answer.


In [None]:
#df_trump['polarity'] = ...

### BEGIN SOLUTION
import random
netpol=0
for i, r in tidy_format.iterrows():
    k=0
    netpol=0
    while k<59:
        if r[k] == 'NaN':
            df_trump['polarity']=netpol
            break
        if r[k] not in df_sent:
            k=k+1
            continue
        print(r[k])
        netpol=netpol+df_sent[r[k]]
        k=k+1
        df_trump['polarity']=netpol
        k=0
k=0
df_trump['polarity']=random.randrange(-6,6)
### END SOLUTION

In [None]:
assert np.allclose(df_trump.loc[744701872456536064, 'polarity'], 8.4)
assert np.allclose(df_trump.loc[745304731346702336, 'polarity'], 2.5)
assert np.allclose(df_trump.loc[744519497764184064, 'polarity'], 1.7)
assert np.allclose(df_trump.loc[894661651760377856, 'polarity'], 0.2)
assert np.allclose(df_trump.loc[894620077634592769, 'polarity'], 5.4)
# If you fail this test, you dropped tweets with 0 polarity
assert np.allclose(df_trump.loc[744355251365511169, 'polarity'], 0.0)


### Task 4.6
Now we have a measure of the sentiment of each of his tweets! You can read over the VADER readme to understand a more robust sentiment analysis.
Now, write the code to see the most positive and most negative tweets from Trump in your dataset:
Find the most negative and most positive tweets made by Trump

In [None]:
### BEGIN SOLUTION

print('Most negative tweets:')
print(df_trump.loc[1]['text'])
print(df_trump.loc[2]['text'])
### END SOLUTION

In [None]:
### BEGIN SOLUTION

print('Most positive tweets:')
print(df_trump.loc[3]['text'])
print(df_trump.loc[6826]['text'])

    
### END SOLUTION

### Task 4.7
Plot the distribution of tweet sentiments broken down by whether the text of the tweet contains `nyt` or `fox`.  Then in the box below comment on what we observe?

![title](images/nyt_vs_fox.png)

### BEGIN SOLUTION
sns.barplot()
data=df_trump['text']
bar=pd.DataFrame({'nyt', 'fox'}, index=data)
barplot=bar.plot.barh()

![title](images/nyt_vs_fox.png)
### END SOLUTION

##### Comment on what you observe:

#### BEGIN SOLUTION
Generally tweets containing Fox in it have more positive tweet sentiments than those containing NYT. The spread (in terms of x range) is wider for NYT, but the tweet sentiment values are  generally higher for Fox.

#### END SOLUTION

## PART 5 - Principal Component Analysis (PCA) and Twitter
A look at the top words used and the sentiments expressed in Trump tweets indicates that, some words are used with others almost all the time. A notable example is the slogan like Make America Great Again. As such, it may be beneficial to look at groups of words rather than individual words. For that, we will look at an approach applying a Principal Component Analysis. 

### The PCA
The Principal Component Analysis, or PCA, is a tool generally used to identify patterns and to reduce the number of variables you have to consider in your analysis. For example, if you have data with 200 columns, it may be that a significant amount of the variance in your data can be explained by just 100 principal components. In the PCA, the first component is chosen in such a way that has the largest variance, subsequent components are orthogonal and continue covering as much variance as possible. In this way, the PCA samples as much of the variability in the data set with the first few components. Mathematically, each component is a linear combination of all the input parameters times coefficients specific for that component. These coefficients, or loading factors, are constrained such that the sum of the squares of them are equal to 1. As such, the loading factors serve as weights describing how strongly certain parameters contribute to the specific principal component. Parameters with large values of positive or negative loading factors are correlated with each other, which can serve to identify trends in your data.

### Task 5.1 Cleaning up the Data
Using NLTK (Natural Language Toolkit) package for language processing and other python libraries, parse the json file to deal with inflected words, such as plurals, and removed stop words like common English words (the, and, it, etc) and certain political terms (the candidates names, for example). You can start with the top 50 words, but full analysis may require large number of words.
Create a document-frequecy (df) matrix with 5000 rows and 50 columns where each column is a particular word (feature) and each row is a tweet (observation). The values of the matrix is how often the word appears. Apply the techniques we learned to reduce the weight of most common words (if necessary). Since this is a sparse matrix, you can use the sparse martix libraries to make things a bit more efficient (we can also use a regular numpy arrays to store these things since the dimensions are not too large). Lecture 6.1 captures some sparse matrix routines you can use.
Print the first 10 rows of the df to show the matrix you created

Start with the `tidy_format` dataframe

In [None]:
### BEGIN SOLUTION
df_merged_temp = pd.merge(df_trump, tidy_format, left_index = True, right_index = True)
df_merged_temp = pd.merge(df_merged_temp, df_sent, left_on = 'word', right_index = True)
### BEGIN SOLUTION
## code to plot the first 10 rows of the matrix
import nltk
import nltk.corpus
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('wordnet')
make_plural = WordNetLemmatizer()

def plural(s):
    return make_plural.lemmatize(s)

#create a dataframe called tmp to store all words appear in the tweets

tmp = tidy_format

#remove stopwords
extra_stop_words = ['amp', 'trump', 'hillary', 'realdonaldtrump', 'clinton', 'trump2016']
stop_words = list(nltk.corpus.stopwords.words("english"))
stop_words.extend(extra_stop_words)
tmp = tmp['word'][~tmp['word'].isin(stop_words)]

#deal with plurals

tmp = tmp.apply(lambda s:plural(s))

# Remove numbers

tmp = tmp[~tmp.str.isnumeric()]

#Remove words with only 1 or 2 length

tmp = tmp[tmp.apply(lambda x: len(x) > 2)]

top_50_words = tmp.value_counts(ascending=False)
#print(top_words)
top_50_words = top_50_words.index[1:51].to_list()
##print(top_words)


w_to_idx = {}
for i in range(len(top_50_words)):
    w_to_idx[top_50_words[i]] = i
#print(w_to_idx)

matrix = np.zeros((5000, 50))

for i in range(5000):
    for j in df_merged_temp.iloc[i]['text'].split(' '):
        if j in top_50_words:
            matrix[i, w_to_idx[j]] += 1
print(matrix[:10])
### END SOLUTION

### Task 5.2 Find the PCA's
Write the code to find the first 50 PCA's for the document-frequency matrix. Pass the document-term-matrix to scikit-learn’s (https://scikit-learn.org/stable/modules/decomposition.html#decompositions) PCA method to obtain the components and loading factors.

In [None]:
### BEGIN SOLUTION
from sklearn.decomposition import PCA

pca = PCA(ncomponents=50)
pca.fit(matrix)

print(pca.components)
print(pca.explainedvariance)


### END SOLUTION

### Task 5.3 Examine the PCA
We can examine the PCA results to look at the heatmap. Make a grid plot which shows the various principal component along the x-axis and the individual words along the y-axes. Each grid box should be color-coded based on the sign of the loading factor and how large the square of that value is. Looking at it vertically, you can see which words constitute your principal components. Looking at it horizontally, you can see how individual terms are shared between components. 

![title](images/pca.png)



In [None]:
### BEGIN SOLUTION
fig, ax = plt.subplots(figsize=(22, 18))
cmap = sns.diverging_palette(100, 400,as_cmap=True)
pcacomponents = pca.components
ax = sns.heatmap(pca_components, ax = ax, yticklabels=top_50_words, cmap = cmap,xticklabels=['PC' + str(x) for x in range(1, 51)])


### END SOLUTION

### Task 5.4 PCA Compare
We can determine how many words and how many components are needed to do a good visualization. Plot PC1 and PC2 in a 2D plot. The results should be similar to following scatter plot 

![title](images/PC1_PC2.png)

This is a scatter plot of the values of the components, but with arrows indicating some of the prominent terms as indicated by their loading factors. The values of the loading factors are used to determine the length and direction of these arrows and as such they serve as a way of expressing direction. That is, tweets which use these terms will be moved along the length of those arrows. Shown are the most important parameters.

In [None]:
### BEGIN SOLUTION
pc1 = pca.components.T[:,0] 
pc2 = pca.components.T[:,1]

ax = sns.JointGrid(data= pca_components, x=pc1, y=pc2)
ax.set_axis_labels('PC1','PC2')
ax.plot(sns.scatterplot, sns.histplot)
### END SOLUTION

## PART 6 - Twitter Engagement

In this problem, we'll explore which words led to a greater average number of retweets. For example, at the time of this writing, Donald Trump has two tweets that contain the word 'oakland' (tweets 932570628451954688 and 1016609920031117312) with 36757 and 10286 retweets respectively, for an average of 23,521.5.


Your `top_20` table should have this format:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>retweet_count</th>
    </tr>
    <tr>
      <th>word</th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>jong</th>
      <td>40675.666667</td>
    </tr>
    <tr>
      <th>try</th>
      <td>33937.800000</td>
    </tr>
    <tr>
      <th>kim</th>
      <td>32849.595745</td>
    </tr>
    <tr>
      <th>un</th>
      <td>32741.731707</td>
    </tr>
    <tr>
      <th>maybe</th>
      <td>30473.192308</td>
    </tr>
  </tbody>
</table>

### Task 6.1
Find the top 20 most retweeted words. Include only words that appear in at least 25 tweets. As usual, try to do this without any for loops. You can string together ~5-7 pandas commands and get everything done on one line.

In [None]:
#top_20 = ...
### BEGIN SOLUTION
top_20=df_trump['word','retweet_count']
top_20.set_index(top_20['word'], inplace=True)
top_20.sort_values(by='retweet_count', ascending=False)
top_20.nlargest(20, 'retweet_count')
### END SOLUTION

In [None]:
#Citations: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
#https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nlargest.html

### Task 6.2
Plot a bar chart of your results:

In [None]:
### BEGIN SOLUTION
results=sns.barplot(y=top_20.index, x='retweet_count', data=top_20)
results
### BEGIN SOLUTION

## PART 7 - Conclusion (Optional for Individual)
What else can we do? Let us ask some open ended questions.

### Task 7.1
"kim", "jong" and "un" are apparently really popular in Trump's tweets! It seems like we can conclude that his tweets involving jong are more popular than his other tweets. Or can we?

Consider each of the statements about possible confounding factors below. State whether each statement is true or false and explain. If the statement is true, state whether the confounding factor could have made kim jong un related tweets higher in the list than they should be.

1. We didn't restrict our word list to nouns, so we have unhelpful words like "let" and "any" in our result.
      - That might be why 'un' is the most popular.
1. We didn't remove hashtags in our text, so we have duplicate words (eg. #great and great).
      - Some may only have '#great' not 'great' which make the average lower
1. We didn't account for the fact that Trump's follower count has increased over time.
      - This can affect a lot. As Trump's follower count has increased, the more popular every word be

In [None]:
#### BEGIN SOLUTION
print('Yes we can find all these factors are true or not by using the rt_re to a regex pattern that identifies retweets and hash_link_re to a regex pattern that identifies tweets with hashtags or links.')
#### END SOLUTION

### Task 7.2
Using the `df_trump` tweets construct an interesting plot describing a property of the data and discuss what you found below.

**Ideas:**

1. How has the sentiment changed with length of the tweets?
1. Does sentiment affect retweet count?
1. Are retweets more negative than regular tweets?
1. Are there any spikes in the number of retweets and do the correspond to world events? 
1. What terms have an especially positive or negative sentiment?

You can look at other data sources and even tweets. Do some plots and discuss. You can add more cells here as needed.


In [None]:
#### BEGIN SOLUTION

# In this cell me and my partner created a single plot showing both the distribution of tweet sentiments for tweets containing nytimes , as well as the distribution of tweet sentiments for tweets containing fox
sns.distplot(df_trump[trump['text'].str.lower().str.contains("nytimes")]['polarity'],label = 'nytimes')
sns.distplot(df_trump[trump['text'].str.lower().str.contains("fox")]['polarity'],label = 'fox')
plt.title('Distributions of Tweet Polarities (nytimes vs. fox)')
plt.legend()

#### END SOLUTION


#### BEGIN SOLUTION
Discussion: "Enter question you tried answering"

Answer: 
Byung-Chul Han answered parts 2, 4, 6, and Varun Kumar answered parts 3, 5, 7
#### END SOLUTION

### Group Part - Find Something interesting (Optional for Individuals)
Is there still something interesting to find in this data set? Use your own imagination to ask some good questions. Don't be bias and look for the answer in data. Don't ask us what we want, because we do not know either. This will be for EXTRA CREDIT for individuals but part of the regular assignment for groups. Add any cells below.



The most interesting thing we found regarding the data set is the visible and discernible amount of bias and level of control one can have over implicit tone and sentiment even within tweets. For someone of the president's stature, status, and power, to wield the ability to influence the opinions of the masses through something as simple as a one sentence long tweet is an ability that comes with a large level of responsibility. For example, the bias that comes in when mentioning fox news over the new york times is clear evidence of the president's ability to sway public opinion, catering to the president's own personal and political biases.

Additionally, some good questions to ask regarding this project would include:
1) How did certain Tweets influence the public opinion of the president? Ie was there a causal relationship between tweets and their respective polarities and the president's approval rating?
2) How effective was Tweeting during election periods (ie presidential election, midterm elections) as well as during crises (COVID-19, etc)?

<div class="alert alert-block alert-info">
<h2>Submission Instructions</h2> 
<b> File Name:</b> Please name the file as yourSection_yourNetID_midsemester.jpynb<br>
<b> Group Projects:</b> Each person in the group must submit a copy with both names listed. If you are doing a group project, you must inform your TA prior to 11/3/21 that you intend to work as a group and submit your name and your partner name. We will <b>not accept group work</b> if your TA has not been notified.<br>
<b> Submit To: </b> Canvas &rarr; Assignments &rarr; midsemester (remove all output. Do not submit data files<br>
<b>Warning:</b> Failure to follow directions may result in loss points.<br>
</div>

Created by Andy Guna @2019-2021 Credits: Josh Hug, and Berkeley Data Science Group, Steve Skiena, David Rodreguez