## Reporting: wragle_report

## Introduction

Real-world data rarely comes clean. Using Python and its libraries, i'll gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. I will document the wrangling efforts in a Jupyter Notebook, plus showcase them through analyses and visualizations using Python (and its libraries).

The dataset that will be used for wrangling, analyzing and visualizing is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog.

WeRateDogs provided their Twitter archive and sent it to Udacity via email exclusively for use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017.

The goal of this project is to wrangle data from WeRateDogs Twitter's account using Python and document the whole process in a Jupyter Notebook named wrangle_act.ipynb i.e. this notebook. My aim is to wrangle this data for interesting and trustworthy analyses using visualizations.

## Project Details

Fully assessing and cleaning the entire dataset would require exceptional effort so only a subset of its issues (eight quality issues and two tidiness issues at minimum) needed to be assessed and cleaned.The tasks for this project were:

 1. Data wrangling, which consists of:
    - Gathering data
    - Assessing data
    - Cleaning data
 2. Storing, analyzing, and visualizing our wrangled data
 3. Reporting on 1) our data wrangling efforts and 2) our data analyses and visualizations


## Gathering Data for this Project
### Enhanced Twitter Archive

The WeRateDogs Twitter archive is provided by Udacity. The file is manually downloadd by clicking the following link: [twitter_archive_enhanced.csv](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv). Once it is downloaded, upload it and read the data into a pandas DataFrame.

### Image Predictions File

The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL:https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

### Additional Data via the Twitter API
Gather **each tweet's retweet count** and **favorite ("like") count** at the minimum, and any additional data of interest. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called `tweet_json.txt` file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) **tweet ID**, **retweet count**, and **favorite count**.




## Project: 

### Assessing Data

Data quality and tidiness issues are two things to look out for, while examining a data set.Quality issues refer to content issues including missing, duplicate, or erroneous data. Tidiness issues has to do with structural problems within the data. I made use of both visual assessment and programmatic assessement to assess the data for quality and tidiness issue.I meticulously looked through the three(3) dataframes and documented issues to be resolved. My findings on the data quality and tidiness issues are further elaborated below, together with their data cleaning mechanisms

### Quality issues
#### df_twitter_archive_enhanced

1. change the tweet_id data type is a string object and applies to all the dataframe tables

2. timestamp and retweeted_status_timestamp should be a datetime type and not an object (string)

3. Establish consistency in the rating_denominator column

4. Dogs name is missing, while others have incorrect names like 'a'

5. Expanded_urls is missing values

6. Remove retweets by deleting rows with non-null values in retweeted_status_id column

7. Extracting the source of tweets from the source columns that is a URL

#### df_image_prediction

1. P1, P2, P3 dog names are inconsistent in that some have there first letter capitalized while others are not capitalized.

#### df_tweet_api
1. Created_at column should be a datetime type instead of an object string

### Tidiness issues
1. Merging doggo, floofer pupper, puppo columns into one column named dog_type

2. df_twitter_archive_enhanced, df_tweet_api, df_image_prediction dataframes are to be merged into one dataframe

### DATA CLEANING
In this section, all the issues documented while assessing were cleaned using the define, code and test steps. clean copies were made for each dataframe before cleaning. At the end all the dataframes were merged into one and further cleaned. the final dataframe was saved as a csv. I didn't spot all the quality and tidiness assessments at the assessing data section, so I was iterating and revisiting assessing to add these assessments to my notes.

### REFERENCES

1. tweepy error code
    - https://www.programiz.com/python-programming/methods/built-in/enumerate
    - https://www.geeksforgeeks.org/enumerate-in-python/
    
2. and enumerate method
    - https://stackoverflow.com/questions/17157753/get-the-error-code-from-tweepy-exception-instance
    
3. measure elapsed time in Python
     - https://stackoverflow.com/questions/7370801/how-do-i-measure-elapsed-time-in-python
    
4. Show a Tweet’s JSON Response
    - https://www.jcchouinard.com/tweepy-json-response-for-tweets/
    
5. Use regular expression to extract name entries that are upper case
     - https://stackoverflow.com/questions/64452644/how-to-extract-the-uppercase-as-well-as-some-substring-from-pandas-dataframe-usi
     
6. Drop rows that contain Nan values
     - https://stackoverflow.com/questions/13413590/how-to-drop-rows-of-pandas-dataframe-whose-value-in-a-certain-column-is-nan