# Word Parsing Module
### Overview

<div class="logos"><img src="../Resources/media_assets/Twitter_Logo_Blue.png" width="200x" align="right"></div> 
<p>
In the current age, big data is a rapidly growing field that scales with our ability to collect it. With widespread internet connectivity, data can be collected and distributed from private industry, instruments and sensors, and "digitizing" historical data. By processing all the data collected, we can advance fields such as social science, physical science, health, and business.
<br><br>
In this module you will utilize data from the social media platform Twitter. The dataset is too large to analyze by hand, so a Python code will be used to aggregate, process, and present the data in a manner that is understandable by humans.

</p>

### Agenda
- Overview and applications
- Import libraries
- Import data in Pandas dataframe
- Sort data
- Count data
- Plot data

## Libraries Used
- **Pandas** - A Python data analysis library 
- **Matplotlib** - A Python library for creating visualizations
<!--- Numpy -->

## Dataset
We will use data extracted from Twitter. The data spans from December 2019 - February 2020 and is based on tweets concerning COVID-19. 

In [None]:
#Section 1 - Find the top 10 tweets from the dataset 

#import packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## View the Pandas methods available. We will use read_csv.
#---(Command to view Pandas methods)

## Import data & create a Pandas dataframe object named df
#---df=pd.read_csv('path_to_file/filename.csv') 


## Print attributes and methods of df we will be using sort and head
#---(Command to view dataframe object attributes and methods)

## View the headers of the dataframe to see what is in it
#---(Command to view headers of dataframe)

## View/create a Pandas series 
#---print(df['Name of Header'])

## Sort the dataframe
#--- df = df.sort_values(by=['Column to sort by'], ascending=False)

## Create another series based on the sort you just completed
#--- name_your_series= df['Your Series']     

## Print your Pandas series out
#---(Command to print series)

## Quesiton:
What is the top 5 most retweeted tweets in the dataset? 
What tweet was viewed the most (received the most impressions)? 


In [None]:
# Section 2 - Categorize entries as Tweet, Retweets, or Replies

#Step 1) Import Libraries     - as done before
#Step 2) Import Data          - as done before
#Step 3) Group dataframe      - see below
#Step 4) Print total elements - see below

#Step 3) Created a grouped dataframe
#--- name_of_dataframe_group = df.groupby(['Series to Group By'])

#Step 4) Print number of elements of each group
#--- apply .size() method to group object



## Questions: 
Did tweets consist of retweets, replies, or actual tweets?<br>
What Language was used the most in the dataset? (English, Spanish, French, etc.)<br>
Where most tweets verified or non-verified?<br>
What operating system was used the most?<br>

In [None]:
# Section 3 - Count the number of hashtags and plot
#Step 1) Import libraries                 -as done before
#Step 2) Import Data                      -as done before
#Step 3) Create Series of Tweet content   -as done before
#Step 4) Parse content for "#"            -see below 
#Step 5) Select the top 10                -see below
#Step 6) Plot                             -see below

#Step 4) Parse content for hashtag
#We  will use regular expressions to parse through tweet content
#--- apply .str.findall(r'string').explode() to series

#Step 5) Get top 10 
#--- top_10_hashtags=hashtags.value_counts().head(10)

#Step 6) Plot using .plt

## Question 
Is there enough information for you to confidently tell if shorter or longer hashtags used more?

In [None]:
#Section 4 - Grouping by month

#Step 1) Import libraries             - as done before 
#Step 2) Import Data                  - as done before 
#Step 3) UTC Time Series              - as done before 
#Step 4) Convert to year and month - see below
#Step 5) Group & Sum appropriately    - see below
#Step 6) Plot                         - as done before


# Step 3) Append month series to dataframe
#--- df['Month'] = pd.to_datetime(df['Tweet Posted Time (UTC)']).dt.strftime('%Y-%m')

# Step 4 & 5) Sum up all the views on each date
#--- month_totals = df.groupby(['Month']).sum()

# Step 5) Plot using .plt


In [None]:
#Section 5) - Grouping by hour

#Step 1) Import libraries             - as done before 
#Step 2) Import Data                  - as done before 
#Step 3) UTC-Time Series              - as done before 
#Step 4) Create Hour of Day Series    - see below
#Step 5) Group & Sum appropriately    - as done before
#Step 6) Plot                         - as done before

# Step 4) Adding a new hour column
#--df['Hour'] = pd.to_datetime(df['Tweet Posted Time (UTC)']).dt.strftime('%H')
#--df['Hour_EST'] = (pd.to_numeric(df['Hour'] ) -4) % 24

# Step 5) sum up all the views on each date
#--hour_totals = df.groupby(['Hour_EST']).sum()

#Step 6) Plot using .plt



## Question
How can you interpret the data?<br>
Why do you think it has this shape?