## Reporting: wrangle_report


### Introduction:
This document contains details surrounding the work done to capture and wrangle the data from the twitter handle <a href="https://en.wikipedia.org/wiki/WeRateDogs">WeRateDogs</a>, captured from the three sources - the ***twitter-archive-enhanced.csv file***, the url ***https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv*** and the file ***tweet_image_pred.csv***.

The file ***twitter-archive-enhanced.csv file*** contains basic tweet data for 5000+ tweets affiliated with WeRateDog, such as  tweetid, source of tweet, tweet text and dog names. Additional data about the tweets (such as retweet counts, favorite e.t.c) was extracted from the url above using an API called tweepy (used for accessing twitter data). Data about image predictions was also extracted using Request library.

### Data gathering:
1. **Downloading the WeRateDogs Twitter archive data (directly from twitter_archive_enhanced.csv)**

   The data gathering process started by importing the pandas and numpy libraries, to help with data analysis and mathematical computations. The data from twitter-archive-enhanced.csv was saved in a dataframe **df_twitter_arc**.

2. **Use of the Requests library to download the tweet image prediction (image_predictions.tsv)**

    The extraction of the image prediction data was done by importing the ***requests*** library and utilizing the ***get*** method to extract data from the url and storing it in the variable ***response***. A quick check of the request status code confirmed that it was 200, and that the request was successful. The content of the data was then stored in a csv file ***tweet_image_pred.csv*** and a check of the first few rows of the dataframe was done to confirm data existed.

3. **Use of the Tweepy library to query additional data via the Twitter API (tweet_json.txt)**

    The extraction of the additional data from the twitter page was done by utilizing the imported ***tweepy*** library, in order to query use the twitter API to get the data. Since the API is not completely open to all users, it required users to be authenticated, so I had to sign up for a twitter account and set up a developer account, by going to the <a href="https://developer.twitter.com/en/docs/basics/developer-portal/overview">Twitter Developer Portal in the "How to apply section"</a>. Once I was approved and received a twitter account and all required information I ran the following code to create the API object to get the data:
    

    consumer_key = 'MY CONSUMER KEY' 
    consumer_secret = 'MY CONSUMER SECRET'
    access_token = 'MY ACCESS TOKEN'
    access_secret = 'MY ACCESS SECRET'

    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)
    api = tweepy.API(auth_handler = auth,
                parser = tweepy.parsers.JSONParser(),
                wait_on_rate_limit = True,
                wait_on_rate_limit_notify = True) 
                
 Since the tweets are stored in json format by Twitter, code was written to read and store the data to a file called ***tweet_json.txt***, from which data was selected and stored in a dataframe ***additional_attrib***.                   


### How data was assessed:

1. Assessment of twitter archive data - (df_twitter_arc dataframe)
    * The info() function was used to check the accuracy of the data types of the columns, and whether there are nulls
        * issues with data types for some columns were found as noted in the project workspace
    * Used describe() to see which columns statistics are  being reported on - another way to determine the accuracy of data types
    * Checked for data quality by running head() and looked at the data in each column - also looked at csv file  
    * Analyzing the dog stage data (in columns doggo, floofer, pupper and puppo) was not easy, since the column name was also used as data, as well as 'None'. To make analysis of the dog stage data easier and for the data to follow a more tabular structure, the columns were consolidated into one column and the previous column names were used as values, using melt function, as outlined in the 'Cleaning Data' section of the project file.

2. Assessment of the image prediction data
    * The info() function was used to check the accuracy of the data types of the columns, and whether there are nulls
        ** no nulls were found
        ** the data type of the tweet_id column was errorneous



3. Assessment of additional twitter data in additional_attrib dataframe:
    * The info() function was used to check the accuracy of the data types of the columns, and whether there are nulls
    * A few rows of the data was looked at using head() to see the data quality

### Data quality, tidiness issues found and cleaning done:
This section of the project outlines the issues found in the data and code used to clean the data in each table:
* Copies of the data tables were made to df_twitter_arc_clean, additional_attrib_clean and df_image_pred_clean

**<u>Cleaning df_twitter_arc_clean</u>**

**Issue #1:**
* The datatypes in the folowing columns were incorrect, but they were removed instead, because they didn't provide any insight:
       in_reply_to_status_id
       in_reply_to_user_id
       retweeted_status_id
       retweeted_status_timestamp
       retweeted_status_user_id 

**Issue #2:**
* A structural issue was discovered. The columns doggo, floofer, pupper and puppo were converted to values by using pandas **melt function**, and were aligned to a new column (dog_stage), so any one of the values will be affilicated to a dog at any point in time. The melt function caused rows to be duplicated, as seen by the checks done with the info() function - the row count increased from 2356 to 9424.
* Duplicate check was done on tweet_ids, and it was confirmed that there were duplicate rows, which were subsequently removed

**Issue #3:**   
* Erroneous data in timestamp column -  +0000 was removed from the end of date time 

**Issue #4:**   
* Data type of timestamp column was string instead of datetime

**Issue #5:**   
* ***expanded_urls*** column has duplicate urls and needed to be split into separate columns
    * three additional columns were created and all but one was dropped, only the first url was kept

**Issue #6:** 
* Datatype of tweet_id was changed from integer to string

**<u>Cleaning twitter additional attribute table</u>**

**Issue #7:** 
* Changed datatype of timestamp 

**Issue #8:**
* Changed datatype of ***tweet_id*** to string

**Issue #9:** 
* Change datatype of ***tweet_id*** from integer to string

**Issue #10:** 
* Data inconsistent in ***p1, p2, p3*** of image_predictions table - replaced underscores with spaces and tested

**Issue #11:** 
* Capitalized the first letter in p1, p2, p3 and replaced the underscores with spaces

### Storing Data:
This section of the project contains code which merged the clean data tables into one table called ***twitter_archive_master***
* Checks for duplicates were done, and duplicates were removed and from this point on the data was analyzed to produce insights, which were transformed to visual graphs, as detailed in the report file act_report. 