## WeRateDogs Data Wrangling Report
### Aaron Johnson



### Gathering

The main `twitter_archive` was provided for us and I was able to save that to pandas data frame by using `pd.read_csv()`. The `images` data frame required me to use the `requests` library to retrieve information from a given url. The last data frame proved to be quite a bit more involved in that it required retrieving twitter API data by using the `tweepy` package. 

Some of the information was not able to be retrieved, with an error message of “No status found with that ID.”

### Assessing

For the initial stages of the assessment I viewed the beginning and end of each data frame, as well as it’s info. This investigation showed that there were quite a few things that needed to be cleaned with the data. 

Issues with quality:
- *timestamp* and *retweeted_status_timestamp* were stored as strings instead of time data.
- 181 of the tweets were actually retweets.
- The values for *source* were stored as an html 'a<' tag.
- Some of the tweets did not have images.
- Tweets had multiple slashes in the *text* column leading to false ratings.
- There were missing dog names.
- Some dogs received the wrong name information.
*in_reply_to_status_id*, *in_reply_to_user_id*, *retweeted_status_id*, *retweeted_status_user_id* all had a float datatype.

### Cleaning

After making a clean copy of each data frame in order to refer to the original raw data, I began *defining*, *coding*, and *testing* each of the issues mentioned in the assessment.

I dropped the tweets that did not have images by writing over the cleaned data with only those tweets whose `expanded_urls` values were not null.

The dogs who had been assigned the wrong name due to the names being extracted incorrectly from the `text` column all had a lowercase word as their name. To fix this I wrote a function to check if the name began with a lowercase letter and to replace the name with a value of ‘None’.

For the dogs with a name of ‘None’, I used numpy to replace the ‘None’ string with a null value.

I wrote over the dataset with only those tweets whose `retweeted_status_id` was null. I then dropped the retweet_status information.

The tidiness issue of the dog stage variables being stored in both rows and columns proved to be the most difficult by far. I spent hours and hours trying to use pandas melt to combine everything into one column. After giving up on `melt` I found a workaround via stack overflow and through another student’s solution (See references section).

I then joined the data frame with the `favorite_count` and `retweet_count` information to the main archive data frame.

As for the quality issues, I converted the `timestamp` and `retweeted_status_timestamp` to date time date types. I then wrote a function that used regular expressions to clean up the information provided in the `source` column.

I then addressed the issue of the false ratings by writing a function that would save the correct rating in a new column. I then converted the fraction rating to a float rating in order to analyze it later.

### Analyzing

For the analysis I was interested to see what the average rating was, what the most frequent dog stage was, and to see what kind of relationship there was between `retweet_count` and `favorite_count`. 

I found that the average rating was around 1.165, the most frequent dog stage was *pupper*, and I found that there appeared to be a positive linear relationship between `retweet_count` and `favorite_count`. 