### Resampling data and plotting it with pandas and Matplotlib 

We did our first data analysis with a large data set and saw how we could answer a research question based on a simple categorization. Although it produced good results, that sort of analysis is limited: it looks at the data at only one point in time. Analyzing data across time, on the other hand, allows us to look for trends and better understand the anomalies we encounter. By exploring the changes in data and isolating specific events, we can make meaningful connections between them. 

In 2017 and 2018, Twitter, Facebook, and Google were heavily criticized for allowing international agents to spread false or misleading content meant to influence public opinion in the US and abroad. This public scrutiny ultimately led to the publication of two major data bundles: one of Russian tweets that—according to Twitter, Congress, and various media reports— were used to manipulate the US media landscape, and another of Iranian tweets doing the same.

Our research question is straightforward: How many tweets related to Trump and Clinton were tweeted by Iranian actors over time? We’ll define Trump- and Clinton-related tweets as tweets that use hashtags containing the string `trump` or `clinton` (ignoring case).  

Here are the steps we need to take:
- Filter our data using a lambda function
- Format a data column into a datetime 
- Use the resample function
- plot the data with matplotlib


Let's start by important `pandas` and `matplotlib` (`import matplotlib.pyplot as plt`):

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

Read in the csv file with the Iranian tweets. You can download them here and make sure you place the data into a `data` folder inside your project folder: [https://archive.org/details/iranian_tweets_csv_hashed/](https://archive.org/details/iranian_tweets_csv_hashed/)

In [None]:
%%time
tweets = pd.read_csv('../data/iranian_tweets_csv_hashed.csv')

In [None]:
tweets.head()

In [None]:
tweets.dtypes

### Filter our data using a lambda function

We want a dataframe that contains only tweets relating to the 2016 presidential candidates. And we’re going to use a simple heuristic for this: include only tweets whose hashtags include the strings trump, clinton, or both. While this may not catch every tweet about Donald Trump or Hillary Clinton, it’s a clear-cut and easily understandable way to look at the activities of these misinformation agents.

For that we will:
- look for tweets that used hashtags containing the substrings `trump` or `clinton`
- filter down our data based on that condition

In [None]:
tweets['hashtags'].value_counts()

In [None]:
tweets['includes_trump_or_clinton'] = tweets['hashtags'].apply(lambda x: 'clinton' in str(x).lower() or 'trump' in str(x).lower())

In [None]:
tweets[tweets['includes_trump_or_clinton'] == True].head(100)

In [None]:
tweets_subset = tweets[tweets['includes_trump_or_clinton'] == True]

### Resampling our data over time

To get a tally of these tweets over time we need to:
- format a column as a date time
- set the index of our dataframe to this date time column
- resample our dataframe 

In [None]:
tweets_subset.dtypes

In [None]:
tweets_subset['tweet_time'] = tweets_subset['tweet_time'].astype('datetime64[ns]')

In [None]:
tweets_subset.dtypes

In [None]:
tweets_over_time = tweets_subset.set_index('tweet_time')
tweets_over_time.head()

In [None]:
tweet_tally = tweets_over_time.resample('M').count()
tweet_tally.head()

In [None]:
monthly_tweet_count = tweet_tally['tweetid']
monthly_tweet_count.head()

### Plot your data 

Now it's time to plot your data to better understand trends over time!

In [None]:
plt.plot(monthly_tweet_count)

In [None]:
len(tweets['tweetid'])

In [None]:
len(tweets)