# Project: Wrangling and Analyze Data


## About The Project
The dataset i will be wrangling (and analyzing and visualizing) is from a tweet archive of a Twitter user with handle @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

WeRateDogs downloaded their Twitter archive and sent it to Udacity via email exclusively for Udacity students to use in their project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017.

My project entails the following majorsteps / tasks with sub tasks to give it for details:

Step 1: Gathering data

Step 2: Assessing data

Step 3: Cleaning data

Step 4: Storing data

Step 5: Analyzing, and visualizing data

Step 6: Reporting

your data wrangling efforts
your data analyses and visualizations


I will be importing the following modules for my project :

1.pandas
2.NumPy
3.requests
4.tweepy
5.json



### My Project's Dataset

I will be working with three (3) datasets. Below is a detailed description of the datasets i will be working with throughout the project. From these datasets i will provide valuable insights

***1. Enhanced Twitter Archive***

The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which I used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, I have filtered for tweets with ratings only (there are 2356).


****2. Additional Data via the Twitter API***

Back to the basic-ness of Twitter archives: retweet count and favorite count are two of the notable column omissions. Fortunately, this additional data can be gathered by anyone from Twitter's API. Well, "anyone" who has access to data for the 3000 most recent tweets, at least. 


***3.Image Predictions File***

One more cool thing: I ran every image in the WeRateDogs Twitter archive through a neural network that can classify breeds of dogs*. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

In [2]:
#Loading modules

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import sklearn
import time
import datetime
import requests
import json
import tweepy
import io
import os

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [None]:
#Reading The Enhanced Twitter Archive csv

twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')


In [None]:
#taking a look at the uploaded data

twitter_archive

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [None]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

# Opening a tsv file and saving the response content
with open('image-predictions.tsv', mode='wb') as file:
    file.write(response.content)
    
# Read TSV file
image_prediction = pd.read_csv('image-predictions.tsv', sep='\t')

In [None]:
# Looking at the information in our data
image_prediction

3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [None]:
# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions

# Get the API object which we will use to gather the twitter data
import tweepy

consumer_key = 'Secret'
consumer_secret = 'Secret'
access_token = 'Secret'
access_secret = 'Secret'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

In [None]:
# Exrtract the IDs from twitter_archive
twitter_archive.tweet_id

In [None]:
# Express the tweet IDs as a list
list(twitter_archive.tweet_id)

In [None]:
# Check the total numbers of tweet IDs
len(list(twitter_archive.tweet_id))

In [None]:
# Using one tweet ID as an example: Get the status of one tweet ID
page = api.get_status(891815181378084864, tweet_mode='extended')

In [None]:
page

In [None]:
# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = twitter_archive.tweet_id.values
tweet_ids

In [None]:
len(tweet_ids)

In [None]:
# Query Twitter's API for the JSON data of each tweet ID in the Twitter archive
index = 0
# dictionary to catch the errors
error_dict = {}
start = time.time()

# Save each tweet's returned JSON as a new line in a .txt file
with open ('tweet_json.txt', 'w') as tweet_bk:
    # This will likely take 20 - 30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        index += 1
        try:
            # Get the status data for each of the tweet IDs
            tweet = api.get_status(tweet_id, tweet_mode = 'extended')
            print(str(index) + ": " + "ID - " + str(tweet_id))
            # Convert each tweet status to JSON string and save it in the tweet_bk file
            json.dump(tweet._json, tweet_bk)
            # recognize \n as a break of text
            tweet_bk.write("\n")
            
        # Catching errors that might occur while accessing the tweet data or content
        except tweepy.TweepyException as error:
            print(str(index) + ": " + "ID - " + str(tweet_id) + " has an error:", error.response.text)
            # Appending the errors to the dictionary; error_dict
            error_dict[tweet_id] = error
            
end = time.time()
print(end - start)

In [None]:
# Extract the missing columns from enhanced twitter archive

# Empty list to convert to DataFrame
df_list = []

# Open text file for reading
with open ('tweet_json.txt', 'r') as json_file:
    for line in json_file.readlines():
        # Read each JSON string status into a dictionary and reading each line as a dictionary
        each_tweet_line = json.loads(line)
        # Getting the required parameters
        tweet_id = each_tweet_line['id']
        retwt_count = each_tweet_line['retweet_count']
        fav_count = each_tweet_line['favorite_count']
        follows_count = each_tweet_line['user']['followers_count']
        frnds_count = each_tweet_line['user']['friends_count']
        
        df_list.append({'id': tweet_id,
                       'retweet_count': retwt_count,
                       'favorite_count': fav_count,
                       'followers_count': follows_count,
                       'friends_count': frnds_count})
        
tweet_json = pd.DataFrame(df_list, columns=['id', 'retweet_count', 'favorite_count', 'followers_count', 'friends_count'])


In [None]:
df_list

In [None]:
tweet_json

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



Visual Assessment


Each piece of gathered data is displayed in the Jupyter Notebook for visual assesment purposes.

In [None]:
df1 = twitter_archive

In [None]:
# Taking a look at the information in our Enhanced_Twit_arc data
df1

In [None]:
df2 = image_prediction

In [None]:
# Taking a look at the information in our image_prediction data
df2

In [None]:
df3 = tweet_json

In [None]:
# Looking at the information in our tweet_json data
df3


Programmatic Assessment


Using pandas' functions and/or methods to assess each gathered data.

In [None]:
# Looking at the concise summary of our Enhanced_Twit_arc data
df1.info()

In [None]:
# Looking at the concise summary of our image_prediction data
df2.info()

In [None]:
# Looking at the concise summary of our tweet_json data
df3.info()

In [None]:
# Looking at the statistical description of our Enhanced_Twit_arc data
df1.describe()

In [None]:
# Looking at the statistical description of our image_prediction data
df2.describe()

In [None]:
# Looking at the statistical description of our tweet_json data
df3.describe()

In [None]:
# looking at a few rows of our Enhanced_Twit_arc data data to check out relevant issues
df1.sample(5)

In [None]:
# looking at a few rows of our image_prediction data to check out relevant issues
df2.sample(5)

In [None]:
# looking at a few rows of our tweet_json data to check out relevant issues
df3.sample(5)

In [None]:
# Get the number of rows and columns in image_prediction dataframe
df2.shape

In [None]:
# Get the number of rows and columns in tweet_json dataframe
df3.shape

In [None]:
# Get the number of missing values in our Enhanced_Twit_arc data
df1.isnull().sum()

In [None]:
# Get the number of missing values in our image_prediction data
df2.isnull().sum()

In [None]:
# Get the number of missing values in our tweet_json data
df3.isnull().sum()

In [None]:
# Get the numbe rof unique values in the columns in our twitter_archive data
df1.nunique()

In [None]:
# Get the number of unique values in the columns in our image_prediction data
df2.nunique()

In [None]:
# Get the number of unique values in the columns in our tweet_json data
df3.nunique()

In [None]:
# Get the number of duplicate rows in our Enhanced_Twit_arc data
df1.duplicated().sum()

In [None]:
# Get the number of duplicate rows in our image_prediction data
df2.duplicated().sum()

In [None]:
# Get the number of duplicate rows in our tweet_json data
df3.duplicated().sum()

In [None]:
# Getting the name of the columns
df1.columns

In [None]:
# Checking for IDs with values in retweet_status_id, retweeted_status_user_id, and
# retweeted_status_timestamp columns
df1[df1['retweeted_status_id'].notnull()]

In [None]:
df1.rating_numerator.value_counts()

In [None]:
df1.rating_denominator.value_counts()

In [None]:
df1.columns

### Quality issues

Enhanced_Twit_arc (df1)

There are tweet IDs that have 'retweeted_status_id, retweeted_status_user _id, and retweeted_status _timestamp values. These ids are that of retweets and won't be used for our analysis

retweeted_status_id, retweeted_status_user_id, and retweeted_status_timestamp columns contain mostly missing values

in_reply_to_status_id and in_reply_to_user_id columns contain mostly missing values

missing values in expanded_urls column

Timestamp column is in int instead of datetime data type

Tweet id column is in int instead of string data type


**Image_predictions (df2)**

The tweet id column is in int instead of string data type
Values in columns 'p1', 'p2', and 'p3' don't have consistent format


**tweet_json (df3)**


Erroneous data type (tweet id column is in int instead of string)

### Tidiness issues


**General**
The column label for tweet IDs should be the same across the three datasets.


**twitter_archive (df1)**
4 columns (doggo, floofer, pupper, puppo) are categories of dog 'stage' and need to be one column 'stage' with 4 categories: doggo, floofer, pupper and puppo in it.


**tweet_json (df3)**
followers_count column has only 24 values and and friends_count columns contains only 1 value.

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [None]:
# Make copies of original pieces of data
df1_clean = df1.copy()
df2_clean = df2.copy()
df3_clean = df3.copy()

### Issue #1: Missing Values

#### Define: 

Remove tweet IDs that have retweeted. These ids are that of retweets and won't be used for our analysis.

#### Code

In [None]:
# Drop retweeted rows
df1_clean = df1_clean[df1_clean.retweeted_status_id.isnull()]
df1_clean = df1_clean[df1_clean.retweeted_status_user_id.isnull()]
df1_clean = df1_clean[df1_clean.retweeted_status_timestamp.isnull()]

#### Test

In [None]:
# Check if the retweets have been droped
print(df1_clean.retweeted_status_id.notnull().sum())
print(df1_clean.retweeted_status_user_id.notnull().sum())
print(df1_clean.retweeted_status_timestamp.notnull().sum())

### Issue #2:

Followers_count column has only 24 values and friends_count columns contains only 1 value.

#### Define

Drop followers_count and friends_count columns as they don't contain necessary values that would be relevant to the analysis.

#### Code

In [None]:
df3_clean.drop(['followers_count', 'friends_count'], axis=1, inplace=True)

#### Test

In [None]:
df3_clean.head()

**Some dog names are invalid eg None, a, an*

**Define**

Convert invalid names (None or starting wih lower case letters) to NaN and extract the correct names from the text column after the word "named"

**Code**

In [None]:
clean_t_archive.name = clean_t_archive.name.replace(regex=['^[a-z]+', 'None'], value =np.nan)

# Checking the number of  null values in name column after conversion
sum(clean_t_archive.name.isnull())

In [None]:
#Define a function to extract names from text column, and return NaN if there is no 'named' word

def function(text):
    txt_list = text.split()
    for word in txt_list:
        if word.lower() == 'named':
            name_index = txt_list.index(word) + 1
            return txt_list[name_index]
        else:
            pass
    return np.nan
        

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

In [None]:
# Saving the master dataset to a csv file
data.to_csv("twitter_archive_master.csv", index=False)

In [None]:
# Check if it was properly saved
data = pd.read_csv("twitter_archive_master.csv")

In [None]:
data.sample()

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

In [None]:
# Looking at the description of our master dataset
data.describe()

In [None]:
# Get the mean value of the dog names
data.name.value_counts() / data.shape[0] * 100

### Insights:

1.The minimum retweet count is 11, mean is 2245, and the maximum retweet count is 70786

2.Image number 1 is the most prominent (frequent)

### Visualization

Question 1: Does retweet count positively correlate with favourite count?

In [None]:
data.corr(method='pearson')

Question 2: How many image number occured most for each tweet's most confident image prediction?

In [None]:
# Get the value count of each image number value
data.img_num.value_counts()

In [None]:
# Let's use countplot to get the distribution of the most frequent image number that corresponds 
# to the most confident prediction
import matplotlib.pyplot as plt
import seaborn as sns
sns.countplot(data.img_num)
plt.title('The Distribution of Tweet Image Number')

Question 2: What is the most popular dog stage according to the neural network's image prediction?

In [None]:
sns.set(style = 'darkgrid')
sorted_age = data['stage'].value_counts().head(3).index
sns.countplot(data = data, x ='stage', order = sorted_age, orient='h')
plt.xlabel('Dog stages', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('The Distribution of Dog Stages', fontsize=16)

In [None]:
sns.set_style('dark')
sns.regplot(data.retweet_count, data.favorite_count, scatter_kws = {'color': 'Green'})

In [None]:
# Get the value count of each dog stage
data.stage.value_counts()

### REFERENCES :

https://stackoverflow.com/questions/31431002/unable-to-import-tweepy-module

https://stackoverflow.com/questions/57062501/i-cant-install-json-module

https://medium.com/@chisompromise/twitter-data-analysis-weratedogs-1fb8b65da7fa

https://m.youtube.com/watch?v=0dkzcshJz0k