## Reporting: wragle_report

Data from the @WeRateDogs Twitter account was first gathered from three places:
- A .csv file was downloaded from Udacity containing a historical archive of tweets. This archive was provided by @WeRateDogs and then Udacity enhanched the file by pulling out the ratings, dog names, and dog_stage variable.
- A .tsv file was programmatically downloaded from Udacity which contained machine learning prediction data for dog breeds. The top three predictions for each image was provided along with a confidence interval and whether the prediction was a dog breed or not.
- Additional tweet data was scraped using the Twitter API and stored as a .txt file with the data in JSON format. This file was read line by line and the data used from this included the tweet_id, retweet count, and favorited count.

The data from each of these files were placed into their own dataframe using the pandas library.

For the `twitter_archive` dataset the following steps were taken:
- Visual assessment of the data both within the Jupyter Notebook and in Excel
- Programmatic assessment was completed within the Jupyter Notebook
- A copy of the data (`twitter_archive_clean`) was created before any cleaning was done

  
### Quality and tidiness issues that were addressed:
- Incorrect `name` values were replaced with NaN values
- `tweet_id` was converted to string format since it represented the end of the URL of a tweet
- `timestamp` was converted to datetime
- Tweets that were replies or retweets were removed from the dataset
- All `rating_denominator` values were changed to 10
- `rating_numerator` values greater than 14 were either adjusted or removed from the dataset
- For `rating_numerator` values less than 6, each tweet was gone through manually to determine if the image was of a dog or if the rating had to be adjusted. Images that were not dogs were removed from the dataset.
- For `rating_numerator` values that were given as decimals in the tweet were rounded up to the nearest whole number
- Tweets that had null `expanded_urls` values were removed from the dataset
- The `dog_stage` column was created and the individual columns for doggo, puppo, pupper, floofer were removed after their respective values were moved to the `dog_stage` column
- Columns that were not needed for analysis was removed (`source`, `expanded_urls`)

For the `img_predictions` dataset the following steps were taken:
- Visual assessment of the data was made in Jupyter Notebook
- Programmatic assessment was completed
- A copy of the data (`img_predictions_clean`) was created before any cleaning was done

### Quality and tidiness issues that were addressed:
- `tweet_id` was converted to string format since it represented the end of the URL of a tweet and would enable this dataset to be merged with the `twitter_archive` dataset later
- For rows that had duplicated values in the `jpg_url` column were removed. I removed the last instance since this was the retweet of an original tweet.
- The `dogger_prediction_type` column was created to keep either the most confident prediction (dog breed or not) or one of the predictions that was a dog breed, or the first prediction if no breed was detected.
- The `dogger_prediction_confidence` column was created to keep the confidence interval for the `dogger_prediction_type` column.
- All other columns relating to predictions was removed from the data set

For the `tweet_json` dataset the following steps were taken:
- Visual assessment of the data was made in Jupyter Notebook
- Programmatic assessment was completed
- A copy of the data (`tweet_json_clean`) was created before any cleaning was done

### Quality and tidiness issues that were addressed:
- No issues were addressed with this dataset since it was already complete and in the correct formats

## Merging

The `twitter_archive_clean` dataset was merged with the `img_predictions_clean` dataset. Once this was done, any records that had null values for the `jpg_url` was removed. 

Once that merge was complete that new dataset was then merged with the `tweet_json_clean` dataset.

## Saving

This master dataset was then saved as the `twitter_archive_master.csv` file