Pandas Jupyter Experiment

For this exercise, you will be choosing a dataset that is already available in CSV, using either the links suggested in the module or an external source. Note that the completeness and organization of the CSV will impact your success, so you might want to investigate it using our initial methods before you commit to it and your object of analysis.
I have provided the headings for each section: you can modify them to reflect your final workflow. Here's what you need to accomplish and document:
Complete the five sequential stages of importing, analyzing, and visualizing your CSV data using the Pandas library. The headers are provided, but you will need to plan out and structure what happens in the code using a combination of our class exercise and the textbook for guidance.

Create a well-structured, readable documentation for every cell of your Python code. Use Markdown (as demonstrated in this example) and preview the results on GitHub to confirm it works as intended.

As a bonus exercise, output and save a meaningful, formatted visualization, following the examples in the textbook.

Stage One: Import Libraries / Data
First, we import all necessary libraries. To adjust this workflow, change the contents of the variable file_name below to the name of your CSV file. Make sure the file is co-located (stored in the same folder) with your notebook to use the formatting as written.


In [6]:
file_name = 'snyder_cut.csv'
import pandas as pd

df = pd.read_csv(file_name, delimiter=",")

FileNotFoundError: [Errno 2] No such file or directory: 'snyder_cut.csv'

Stage Two: Display a Summary and Sub-sections of the Data
Note that your dataset may have errors on importing depending on contents / scale.

This stage displays some summary information, a random sample, and the ten most common usernames in the dataset so we can get a sense of the contents.

In [4]:
print(df.describe(include='all'))

print(df.sample(10))

print(df['from_user'].value_counts()[:10])

NameError: name 'df' is not defined

Stage Three: Clean Your Data
Cleaning Twitter data is dependent upon your goals: for this example, we first remove all duplicates by text (to avoid the excessive retweets). We then delete several columns that aren't useful for our analysis - this might vary by project, so to edit this, comment out any deletions below that don't apply.



In [None]:
# The first section deletes the unwanted columns
del df['geo_coordinates']
del df['user_lang']
del df['in_reply_to_screen_name']
del df['in_reply_to_status_id_str']
del df['in_reply_to_user_id_str']
del df['from_user_id_str']
del df['source']
del df['profile_image_url']
del df['status_url']
del df['entities_str']
del df['id_str']

In [None]:
# The second section drops duplicates and replaces empty text

df = df.drop_duplicates('text')
df.fillna('Not available')

Stage Four: Plot Your Data
This notebook includes three visualizations that should work with any dataset.

To handle larger datasets with multiple days of date, the first example plots over time, but first truncates each timestamp to remove minutes and seconds.

The second example uses extraction to draw any element with the structure "#text" out of our text. We can then plot the values of common hashtags: excluding the hashtag at 0 is an easy way to remove the original search query.

The final example uses a pie chart (which is rarely as useful) to demonstrate relative activity of users. This is a good quick way to see how much a particular conversation relies upon a few participants or amplifiers.



In [None]:
# First Visualization: Data Over Time (Simplified to Hour)
 
df['tweet_hour'] = df['time']
df['tweet_hour'] = df['tweet_hour'].str.slice(0, 13)

print(df['tweet_hour'][:5])

df['tweet_hour'].value_counts().plot(kind='line', title = 'Tweet Volume by Hour')


In [None]:
<AxesSubplot:title={'center':'Tweet Volume by Hour'}>

In [None]:
# Second Visualization : Hashtags in the Data

hashtags = df['text'].str.extractall(r'(\#\w+)')[0].value_counts()

hashtags[1:10].plot(kind='barh', title='Top Hashtags')

In [None]:
<AxesSubplot:title={'center':'Top Hashtags'}>

In [None]:
# Third Visualization: Users

df['from_user'][0:30].value_counts().plot(kind='pie', title='Top Users')

In [None]:
<AxesSubplot:title={'center':'Top Users'}, ylabel='from_user'>


Stage Five: Draw Comparisons and Make Claims
While I won't analyze this dataset here, try borrowing these methods and playing with the parameters to see what you can find!



Bonus: Export a Meaningful Visualization
This requires two tools: first, we have to grab the figure itself using get_figure from whichever visualization has been most successful.

Next, we can use savefig for a range of filetypes: note that long text will get cut off without using the parameters below.

In [None]:
fig = hashtags[1:10].plot(kind='barh', title='Top Hashtags').get_figure()
fig.savefig('hashtags.png', dpi=300, bbox_inches='tight')