## Report: act_report

# Project: Wrangling of We Rate Dogs Twitter Data

## Table of Contents
<ul>
<li><a href="#intro">Introduction to Wrangling</a></li>
<li><a href="#Gathering">Data Gathering</a></li>
<li><a href="#Accessement">Data Accessement</a></li>
<li><a href="#Cleaning">Data Cleaning</a></li>
<li><a href="#Analyzing & Visualizing">Analyzing & Visualizing Data</a></li>
<li><a href="#conclusions">Conclusion</a></li>
</ul>

<a id='intro'></a>
## Introduction to Wrangling


WeRateDogs is a twitter account that posts dogs photos with a rating of the dog. The data used in this project is a download of the tweets posted by WeRateDogs account. The data contains information about the dog e.g. a photo of the dog, name, breed or 'age group' and rating of the dog. Secondary data also obtained from the data are retweet counts, favorite count. The goal of this report is to gather all the data partaining the WeRateDogs account (there are three different datasets), assesses the data, noting all the issues persent, cleaning the noted issues and perform vaious analysis.

The following are some of the research questions that are to be answered:
* What device is the most used to for tweeting the content on WeRateDogs account?
* What is the relationship between favorite counts and retweet counts?
* What is the relationship between the numerator rating and the favorite counts?

<a id='Gathering'></a>
## Data Gathering

Three datasets were provided for this project:
* Json file containing raw tweets data extracted using tweepy
* tweet image predictions file containing predictions of all the dog images through a machine learning model
* WeRateDogs Twitter archive file of partialy extracted tweets data

First the following packages were import to aid with the data wrangling and visualisation
><ol>
  <p><img src="packages.png" style="height: 300px;"/></p>
</ol>
1. JSON File
><ol>
  <p><img src="json.png" style="height: 400px;"/></p>
</ol>
2. Tweet Image Predictions
><ol>
  <p><img src="predictions.png" style="height: 300px;"/></p>
</ol>
3. WeRateDogs Twitter Archive File
><ol>
  <p><img src="archive.png" style="height: 300px;"/></p>
</ol>

<a id='Accessement'></a>
## Data Accessement

After accessing the data both visually and programatically a number of data quality and tidiness issues were noticed and recorded as follows:
### Quality issues
##### `twitter_archive` `json` files
* Some numerator values were wrongly extracted as per the text field e.g tweet with id #680494726643068929 was extracted as 26/10 yet it should be 11.26/10

* Timestamp is not in the correct format, to be converted to a timedate format

* Source column is not clean, contains some html left over special charaters

* Tweet id extracted from json file has data type object and not integer

* in_reply_to_user_id/in_reply_to_status_id columns should have int as their data types and not float

* Some denominator and numerator values need to be transformed to be in the same format as other tweets e.g #677716515794329600 144/120 which should be 12/10

* rating numerator column has outliers, it has values values as big as 1776

* rating denominator column has outliers, it should a 10 across all tweets, has values both less than and more than 10

### Tidiness issues
##### `twitter_archive` file
* Source column can be split into two different columns with device and url column being created

* Retweeted tweets are part of the main dataframe, these needs dropping

<a id='Cleaning'></a>
## Data Cleaning

All the captured issues were cleaned by first redefining the issues, coding and lastly testing to check if the issues had been fixed. Below are three examples of the issues noted above having been fixed:
* Issue One: Format of the timestap column was not clean enough to be used for analysis, it had a string format instead of datetime
><ol>
  <p><img src="./issue1.png" style="height: 400px;"/></p>
</ol>
* Issue Three: Source column was split into device and url columns as it was not tidy enough
><ol>
  <p><img src="./issue3.png" style="height: 400px;"/></p>
</ol>
* Issue Seven and Eight: Here some of the values in rating_numerator column were wrongly extracted while some need transformation to fit the scale
><ol>
  <p><img src="./issue78.png" style="height: 400px;"/></p>
</ol>

<a id='Analyzing & Visualizing'></a>
## Analyzing & Visualizing Data

### Research Questions
* What device is the most used to for tweeting the content on WeRateDogs account?
><ol>
<p><img src="./dist.png" style="height: 120px;"/> </p>  
</ol>
><ol>
<p><img src="./pie.png" style="height: 500px;"/> </p>  
</ol>
* What is the relationship between favorite counts and retweet counts?
><ol>
<p><img src="./sca1.png" style="height: 700px;"/> </p>  
</ol>
* What is the relationship between the numerator rating and the favorite counts?
><ol>
<p><img src="./sca2.png" style="height: 700px;"/> </p>  
</ol>

<a id='conclusions'></a>
## Conclusions
><ol>
  <p>Majority of the tweets from WeRateDogs accounts were posted using an Iphone. 98.04% of the tweets have Twitter for Iphone as the device used with the remaining less than two percentage having been posted by TweetDeck (1.40%) and Twitter Web Client (0.56%)</p>
  <p>There is a positive correlation between retweet count and favorite count, an increase in retweet count would mean also there would an increase in the favourites counts</p>
  <p>Best rated dogs are between ratings 13 and 14. Likewise these dogs received more favorite counts as compared to least rate dogs at 7 which similarly received low counts of retweets and favorite</p>
</ol>
