# **Wrangling Report**

This report summarizes data wrangling efforts done to enable the
analysis of WeRateDogs Twitter data. It involved data collection from
multiple sources, assessment of quality and tidiness, and cleaning for
analysis.

### **Gathering**  
Three datasets were retrieved:

**Twitter archive**: This was provided as a CSV file.

**Image Predictions**: Downloaded from URL using the requests library

**Extra tweet data (favorites count,retweet count)**: Gathered using the Twitter API.


### **Assess**  
Visual and programmatic assessments were done on each dataset. Some of
the key problems identified are as follows:  
**Twitter Archive**:  

The "name" column contains incorrect values.

The dog stage information is spread out over several columns.

The "source" column is incorrectly formatted. 

The presences of not just the original ratings only - but also the
retweets and replies for each tweet.  

The rating numerator and rating denominator values columns contain
outliers.

Wrong data type for some columns.

**Image_predictions**

Prediction columns having inconsistent capitalization.

Prediction values having inconsistent delimiters.  

Wrong data type for tweet_id column.

**Tweet Data**

Wrong data type for tweet_id column


### **Cleaning**  
The following cleaning steps were done :  
**Twitter Archive**

* Filtered out Retweets and Replies - Quality issue

* Validated names: replaced 'None' with 'No Name' - Quality issue

* Merged columns of dog stage into one column called 'stage' - Tidiness issue

* Reduced categories in column 'source' - Quality issue

* Used IQR method to remove outliers from rating columns - Quality issue

* Converted column 'timestamp' to datetime and 'tweet_id' to string - Quality issue

* Added a column that represents rating ratio - Tidiness issue

**Image Predictions**

* Filtered to contain the same tweets as the cleaned archive - Quality issue

* Standardized capitalization of prediction columns - Quality issue

* Standardized delimiters for prediction values - Quality issue

* Converted column 'tweet_id' to string - Quality issue

**Tweet Data**

* Filtered to retain only tweets that have a match in the cleaned archive - Quality issue

* Converted 'tweet_id' to string - Quality issue

### **Data Storage**

The 3 cleaned data sets were combined into one master dataset which was
stored as 'twitter_archive_master.csv'.

### **Conclusion**
The data wrangling process addressed quality and
tidiness issues in the datasets; now they are ready for meaningful
analyses.

This report summarizes the data wrangling efforts.