### Wrangled data insights and visualization report of WeRateDogs twitter page

### Introduction:
This report contains the insights and visualizations resulting from wrangling the tweet data from the Twitter handle <a href="https://en.wikipedia.org/wiki/WeRateDogs">WeRateDogs</a>, where people post information about their dogs, for people to rate and add comments. The data was gathered from various sources - some data was provided to Udacity and some were extracted programmatically via the Twitter API Tweepy. Here is a snippet of the WeRateDogs twitter page: 

![dog-rates-social.jpg](attachment:dog-rates-social.jpg)

The data consisted of basic tweet data for 5000+ tweets affiliated with WeRateDog, such as  tweetid, source used for the tweets, tweet text, dog names, additional data about the tweets (such as retweet counts, favorite e.t.c) and data containing predictions of images. A great effort was put into assessing and cleaning the data, and merging it into a final dataset, from which insights and visualization were produced. 

### Insights and visualizations:
 
Some of the insights captured from the data are as follows:
* Dogs with the highest favorites
* The top 9 popular dog names
* The most popular source for tweets
* The dogs with the highest retweets
* The highest rated dog - caluculated using numerator and denominator
* Relationship between retweets and favorites

#### 1. Dog breeds with the ten highest favorites count
The calculation of the most favorite tweets was done by getting the maximum value from the favorites column of the dataset and extracting the dog name. The dog breed which was found to have the highest favorite count was a Labrador retriever.
 
![Dog_breeds_favorites.png](attachment:Dog_breeds_favorites.png)

#### 2. Top nine popular dog names
The calculation of the ten most popular dog names was done by getting the count for  each occurrence of the dog name using the value counts function, then choosing the first 11 top counts. Eleven was chosen because there are a number of rows with missing dog names, that were instead given the value 'None'. There is no way of telling whether 'None' includes the dog names already specified in the dataset, so I would consider this an outlier and exclude this from the analysis. There is also values of 'a' in name column, which is obviously not correct - dog names were extracted programmatically from the 'text' field and based on my assessment it appears that the value 'a' appears in the position names are extracted from. Like 'None', value of 'a' will be treated the same, and not be considered in this report. 

![Top_dog_names.png](attachment:Top_dog_names.png)

#### 3.  The source where the highest number of  tweets were sent from
There were three main sources used to tweet, the most popular being 'Twitter for iPhone'.

![Tweet_sources.png](attachment:Tweet_sources.png)

#### 4.  Dogs with the most retweets
The maximum amount of retweets was calculated by using the pandas max() function on the retweet column for every dog name listed in the dataset.
The highest number of retweets were for dogs with no name (value of None in cell). Since None means the name is not provided, to avoid schewing the analysis, this was not included, since there was no way of telling what names were missing. 


![Top_retweets.png](attachment:Top_retweets.png)

#### 5.Top ten highest rated dogs
The ratings for each dog were calculated from two columns in the dataset - numerator rating and denominator rating. Although the rating scale is generally on a scale of 0 to 10, the numerator and denominator in the dataset have a large difference in some cases, resulting in numbers larger than 10 when the numerator is divided by the denominator. Although some twitter users questioned the rating, it is considered acceptable. One calculated rating resulted in an infinity value (shown as 'inf' in dataset). Since it is so large, this can be considered an outlier and was not included. 
![Top_rated.png](attachment:Top_rated.png)

#### 6. Relationship between retweets and favorites
The scatter plot below shows the mainly shows there is a positive correlation between the number of retweets for the dogs and favorites.
![Retweets_favorites.png](attachment:Retweets_favorites.png)

#### 7. Accuracy of image prediction
An analysis was done on some data which contained predictions on images. The data contained the results of an algorithmn which ran three times as well as the confidence level for how likely the prediction made is right. The maximum value for each of the three predictions was calculated and the results revealed that the first prediction had the highest percentage, i.e. 99%, whereas the highest confidence level for the second and third predictions were 46% and 27% respectively. No plots were done for this analysis. The first predictions made on the images were deemed to be most accurate.

### Conclusion:
The data wrangling task was quite extensive, as it involved extracting the data from several sources into three datasets initially, then assessing the data to determine datatypes correctness and data cleaniness. The analysis required another level of effort, to determine which datapoints would add good insight to reporting. 
Beyond the forementioned, a summary of the analysis included determining the dog breeds with the ten highest favorites count, the top nine popular dog names, the sources where tweets were sent from, dogs with the most retweets, the top ten highest rated dogs, there is a positive correlation between the number of retweets and favorites received by the dogs, and the first algorithmn was found to be the most accurate at predicting the images.  

For the most part, this project provided the opportunity to implement lots of skills surrounding data analysis and wrangling, as well as utilizing various functionalities in pandas, numpy, python and data visualization tasks. 


### References:
* Pandas documentation
* StackOverflow
* Lessons from Data Analyst Nanodegree
* Udacity Knowledge board
* Twitter
* Other online sites