# Project: Wrangle and Analyze the Dataset of WeRateDogs Twitter Archive

To conclude the project, several insights discovered from the dataset will be presented in this report. Before diving into insights of the dataset, it would be considerate to begin with a description on the limitations of the dataset we inspected. We should be aware of the following restrictions of the dataset:

* The timespan of the dataset. The time interval starts from 2015-11-15 22:32:08 and ends at 2017-08-01 16:23:56. Only Tweets posted in this period of time were recorded in the dataset.
* The coverage of Twitter API. We have access to data that can be retrieved via Twitter API [Standard v1.1](https://developer.twitter.com/en/docs/twitter-api/v1) only, which means some of the details, such as *quote_count* and *reply_count*, cannot be retrieved unless you have access to [Premium v1.1](https://developer.twitter.com/en/docs/twitter-api/premium) or [Enterprise](https://developer.twitter.com/en/docs/twitter-api/enterprise) tier services.
* The way [WeRateDogs](https://weratedogs.com/) works. It's really easy: just use Direct Message on Twitter to submit photos of your dogs to [@dog_rates](https://twitter.com/dog_rates), and they will handle the rest. But this simplicity has its downsides, too. Since @dog_rates takes care of almost everything from selecting photos to ratings and posting Tweets with a humorous comment (that's right, these amusing contents are not coming from dog masters, except photos, dog names, and Direct Message itself), its audience is relatively passive in the entire dog rating system.

As we shall see in the following paragraph, the insights discovered are also influenced by these limitations.

In [1]:
# https://stackoverflow.com/questions/51576756/display-render-an-html-file-inside-jupyter-notebook-on-google-colab-platform
# https://stackoverflow.com/questions/25692293/inserting-a-link-to-a-webpage-in-an-ipython-notebook
import IPython

#### Insight 1 : Most Popular Utility Used by WeRateDogs to Post the Tweets

In [2]:
IPython.display.HTML(filename='vbar_utility_type_count.html')

At first glance we may think this plot showed the client utility preferences of the WeRateDogs audience when they post their Tweets if we hide the title of the bar plot. It wasn't. This plot showed the utility preferences of Twitter user who runs @dog_rates account and no more. Even though it makes sense for the dog masters to take photos of their good dogs mainly by using their smartphones, there is no way for us to confirm this assumption since these information has been consequently concealed due to the passive stance of the audience.

However, this piece of information is not altogether futile; it gives us an idea about how [Matthew Nelson](https://twitter.com/dogfather) and his team members produce the good dog Tweets every day. According to the news report [here](https://www.elitedaily.com/p/weratedogs-matt-nelson-is-responsible-for-the-captions-on-your-favorite-dog-photos-15959821), he and his team would receive 800 to 1000 photo submissions per day, and they must pick up the best photo from this pile, generate a comical comment, and post the Tweet to serve the audience. It may be typical to use PCs to handle such a workload, but it seems they managed to cope with the task by using only their own iPhones to collaborate. It would be interesting to know how many hours a day do they feast their eyes on iPhone screens.

#### Insight 2 : Observe if the Hyperboles of Numerator and Denominator Could Lead to Higher Counts of Retweets or Favorites

In general, the rating scales are designed to be objective indicators, which could be employed to measure the quality of products, or to commence performance reviews of personnel in organizations. Or, if we try to put this in a casual way, the rating scales are mostly designed to serve commercial purposes.

WeRateDogs does not use rating scales in this way, as it commonly rates dog photos received from Twitter user with numerators bigger than 10, and some photos even received rating up to 3rd or 4th digit. Below are some of the Tweets with exaggerated ratings, and these should give us a rough idea of the way it uses its rating scales.

In [3]:
%%html
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">This is Atticus. He&#39;s quite simply America af. 1776/10 <a href="https://t.co/GRXwMxLBkh">pic.twitter.com/GRXwMxLBkh</a></p>&mdash; WeRateDogs® (@dog_rates) <a href="https://twitter.com/dog_rates/status/749981277374128128?ref_src=twsrc%5Etfw">July 4, 2016</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 

This one shouldn't be hard. Almost everyone on earth knows what happened on [July 4, 1776](https://en.wikipedia.org/wiki/United_States_Declaration_of_Independence).

In [4]:
%%html
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">After so many requests... here you go.<br><br>Good dogg. 420/10 <a href="https://t.co/yfAAo1gdeY">pic.twitter.com/yfAAo1gdeY</a></p>&mdash; WeRateDogs® (@dog_rates) <a href="https://twitter.com/dog_rates/status/670842764863651840?ref_src=twsrc%5Etfw">November 29, 2015</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 

It's a [Dogg](https://www.youtube.com/watch?v=ek1G1hVliMI), not a dog.

In [5]:
%%html
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">When you&#39;re so blinded by your systematic plagiarism that you forget what day it is. 0/10 <a href="https://t.co/YbEJPkg4Ag">pic.twitter.com/YbEJPkg4Ag</a></p>&mdash; WeRateDogs® (@dog_rates) <a href="https://twitter.com/dog_rates/status/835152434251116546?ref_src=twsrc%5Etfw">February 24, 2017</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 

This is the only Tweet that received a 0/10 rating in the dataset. We all know it is a disgrace to [steal other people's works](https://en.wikipedia.org/wiki/Plagiarism).

Frankly these data would be discarded in no time when they presented themselves in ordinary rating scales, but we are keeping all of these data because we know [they're good dogs, Brent.](http://knowyourmeme.com/memes/theyre-good-dogs-brent) After all, Matt started running @dog_rates simply hoping to [make people laugh](https://weratedogs.com/pages/about-weratedogs), which means their rating scales should also coordinate with this spirit. To WeRateDogs, ratings are not about quality or performance; they are about attracting the audience and making them feel happy.

And when Matt and his team try to do their duties by making use of Twitter services, their peculiar rating scales will inevitably become potential factors which may have an influence on how other Twitter users respond to their Tweets. In this insight we will try to confirm if there are any obvious relations between their ratings, and retweets and favorites they gathered.

First we will check the frequency of each numerator and denominator value.

In [6]:
IPython.display.HTML(filename='hbar_abs_freq_rating_denominator.html')

In [7]:
IPython.display.HTML(filename='hbar_abs_freq_rating_numerator.html')

We can see most of the denominator values are 10, and most of the numerator values fall in the range of 10 ~ 13. Next we will check the percent proportion of both denominators and numerators.

In [8]:
IPython.display.HTML(filename='hbar_rel_freq_rating_denominator.html')

In [9]:
IPython.display.HTML(filename='hbar_rel_freq_rating_numerator.html')

We can see the following facts from the plots we created earlier:
* For numerators, values between 10 ~ 13 comprise 76% of the whole
* For denominator, value 10 alone comprises more than 99% of the whole

Seeing that certain values of numerators and denominators could comprise more than three-quarters of the whole column data, and the hyperbolic values of ratings belong to the remaining 1/4 fraction, we will then try to visualize the correlations between the 2 fractions of numerator/denominator, and the count of retweets and favorites. Jitters are added to the scatter plots to alleviate the overlaps of data points.

(Notice: you can toggle between different subsets of data by clicking on legends of the scatter plots.)

In [10]:
IPython.display.HTML(filename='scatter_rating_denominator_vs_retweet_count.html')

In [11]:
IPython.display.HTML(filename='scatter_rating_denominator_vs_favorite_count.html')

From the plots we can see the rating denominators barely able to influence retweet counts and favorite counts. For denominators with their values *not* equal to 10, their effects diminished even more. When WeRateDogs assigned a rating denominator with a value other than 10 in their Tweets, the highest retweet count achieved was 3,716, and the highest favorite count achieved was 13,518.

In [12]:
IPython.display.HTML(filename='scatter_rating_numerator_vs_retweet_count.html')

In [13]:
IPython.display.HTML(filename='scatter_rating_numerator_vs_favorite_count.html')

The rating numerators showed a slightly stronger influence on retweet counts and favorite counts, despite the weak correlations between these variables. When WeRateDogs assigned a rating numerator with a value *outside* the range between 10 ~ 13 in their Tweets, the highest retweet count achieved was 42,228, and the highest favorite count achieved was 95,450.

Overall there is no strong correlation between ratings and the count of retweets and favorites. This is an acceptable answer, as we can see WeRateDogs ostensibly treats their ratings just like a string in their Tweet texts, a string that may or may not attract the audience, and no more. And from the results of analysis, we can expect WeRateDogs would take actions other than fiddling with the ratings to make their Tweets of good dogs to be more attractive.

#### Insight 3 : Observe if the Different Dog Stages Could Lead to Higher Counts of Retweets or Favorites

After seeing the possible correlations between numerator/denominator and count of retweets and favorites, it is time to further inspect whether the so-called dog stages, a new breed of vocabularies invented by the author of [Dogtionary](https://www.amazon.com/Dogtionary-Meaningful-Portraits-Sharon-Montrose/dp/B00030KOOY), have any impact on count of retweets and favorites or not.

According to the dataset, WeRateDogs began to use these dog stage words in their Tweets dating back to 2015-12-02, and there are only about 300 Tweets in the dataset which WeRateDogs categorized them with dog stages, making these specific Tweets to become a relatively special group.

In [14]:
IPython.display.HTML(filename='scatter_rating_denominator_vs_retweet_count_good_dog.html')

In [15]:
IPython.display.HTML(filename='scatter_rating_denominator_vs_favorite_count_good_dog.html')

We can already find some of the Tweets in the group are the ones with the highest count of retweets and favorites in the entire dataset.

In [16]:
IPython.display.HTML(filename='scatter_rating_numerator_vs_retweet_count_good_dog.html')

In [17]:
IPython.display.HTML(filename='scatter_rating_numerator_vs_favorite_count_good_dog.html')

The scatter plots above showed that the density of data points would become less compact when retweet_count reached 20,000 or favorite_count reached 40,000, which implies this may be the dividing line to separate the good Tweets and the best Tweets. Next we will count the occurrences of Tweets above this line, and calculate the percent proportions for Tweets with/without dog stages blended into the comment texts.

||Total No. of Tweets|No. of Tweets with retweet_count > 20000|Pct. Proportion of Tweets with retweet_count > 20000|
|---|---:|---:|---:|
|With Dog Stages|303|7|2.31%|
|Without Dog Stages|1668|13|0.78%|

||Total No. of Tweets|No. of Tweets with favorite_count > 40000|Pct. Proportion of Tweets with favorite_count > 40000|
|---|---:|---:|---:|
|With Dog Stages|303|13|4.29%|
|Without Dog Stages|1668|33|1.98%|

We can also further calculate the same metrics for dog Tweets of different dog stages. Notice there are some Tweets which have more than one dog stage assigned to them.

|Dog Stage Name|No. of Tweets in the Dataset|No. of Tweets with retweet_count > 20000|Pct. Proportion of Tweets with retweet_count > 20000|
|---|---:|---:|---:|
|doggo|73|5|6.85%|
|floofer|8|0|0.00%|
|pupper|209|1|0.48%|
|puppo|23|1|4.35%|

|Dog Stage Name|No. of Tweets in the Dataset|No. of Tweets with favorite_count > 40000|Pct. Proportion of Tweets with favorite_count > 40000|
|---|---:|---:|---:|
|doggo|73|9|12.33%|
|floofer|8|0|0.00%|
|pupper|209|3|1.44%|
|puppo|23|3|13.04%|

It seems that when used properly, dog stages can increase the odds to produce a far-reaching dog Tweet, at least it should be more hopeful than depending on [Brent's good dogs rating scales.](http://knowyourmeme.com/memes/theyre-good-dogs-brent)

And, just as we said before, dog stages gained some of the Tweets with the highest counts of retweets and favorites in the dataset. Following is a brief summary of the best Tweets of individual dog stages and their respective maximum count of retweets and favorites.

|Dog Stage Name|No. of Tweets in the Dataset|Max. Retweet Count of a Tweet|Max. Favorite Count of a Tweet|
|---|---:|---:|---:|
|doggo|73|79,515|131,075|
|floofer|8|18,497|33,345|
|pupper|209|32,883|106,827|
|puppo|23|48,265|132,810|

And these are the best Tweets which attained the maximum counts listed above.

In [18]:
%%html
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Here&#39;s a doggo realizing you can stand in a pool. 13/10 enlightened af (vid by Tina Conrad) <a href="https://t.co/7wE9LTEXC4">pic.twitter.com/7wE9LTEXC4</a></p>&mdash; WeRateDogs® (@dog_rates) <a href="https://twitter.com/dog_rates/status/744234799360020481?ref_src=twsrc%5Etfw">June 18, 2016</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 

In [19]:
%%html
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Atlas rolled around in some chalk and now he&#39;s a magical rainbow floofer. 13/10 please never take a bath <a href="https://t.co/nzqTNw0744">pic.twitter.com/nzqTNw0744</a></p>&mdash; WeRateDogs® (@dog_rates) <a href="https://twitter.com/dog_rates/status/776218204058357768?ref_src=twsrc%5Etfw">September 15, 2016</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 

In [20]:
%%html
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">This is Jamesy. He gives a kiss to every other pupper he sees on his walk. 13/10 such passion, much tender <a href="https://t.co/wk7TfysWHr">pic.twitter.com/wk7TfysWHr</a></p>&mdash; WeRateDogs® (@dog_rates) <a href="https://twitter.com/dog_rates/status/866450705531457537?ref_src=twsrc%5Etfw">May 22, 2017</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 

In [21]:
%%html
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Here&#39;s a super supportive puppo participating in the Toronto <a href="https://twitter.com/hashtag/WomensMarch?src=hash&amp;ref_src=twsrc%5Etfw">#WomensMarch</a> today. 13/10 <a href="https://t.co/nTz3FtorBc">pic.twitter.com/nTz3FtorBc</a></p>&mdash; WeRateDogs® (@dog_rates) <a href="https://twitter.com/dog_rates/status/822872901745569793?ref_src=twsrc%5Etfw">January 21, 2017</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 

And then here comes two more pieces of clues: the *doggo* Tweet used a *video* clip instead of ordinary photos, and the *puppo* Tweet placed a *hashtag* in the text. All of these guided us to the next insight in this report.

#### Insight 4 : Observe if the Extra Contents Like Hashtags and Videos, Could Lead to Higher Counts of Retweets or Favorites

In contrast to insights we examined before, which are all geared toward running @dog_rates only, this insight is user-actionable (perhaps we should say audience-actionable). You are free to submit a dog video instead of a dog photo, and it's okay to take your dog with you to attend most of the public events in most of the countries. But we still need to know if it is worthwhile to devote our time to do the things.

In [22]:
IPython.display.HTML(filename='scatter_rating_denominator_vs_retweet_count_extra.html')

In [23]:
IPython.display.HTML(filename='scatter_rating_denominator_vs_favorite_count_extra.html')

In [24]:
IPython.display.HTML(filename='scatter_rating_numerator_vs_retweet_count_extra.html')

In [25]:
IPython.display.HTML(filename='scatter_rating_numerator_vs_favorite_count_extra.html')

Then we summarize the plots in the following tables.

||Total No. of Tweets|No. of Tweets with retweet_count > 20000|Pct. Proportion of Tweets with retweet_count > 20000|
|---|---:|---:|---:|
|With Video|72|9|12.50%|
|With Animated GIF|3|1|33.33%|
|With Hashtag|22|1|4.55%|
|None of the Above|1876|9|0.48%|

||Total No. of Tweets|No. of Tweets with favorite_count > 40000|Pct. Proportion of Tweets with favorite_count > 40000|
|---|---:|---:|---:|
|With Video|72|11|15.28%|
|With Animated GIF|3|0|0.00%|
|With Hashtag|22|3|13.63%|
|None of the Above|1876|32|1.71%|

The results revealed some interesting facts:

* Tweets with video clips rocks, and it is a feasible way for the audience to promote their good dogs. But it would take a significant amount of time and effort to make a clip. Plus, your dogs may not be as cooperative as you expected.
* The animated GIF appears to be one of the extinct species in the world of dog Tweets. You may want to join the ranks of GIF advocates, but it is recommended to choose the way which you feel convenient and easy to record a film, since you would have more energy to think about how to exhibit you good dogs.
* The hashtags may not look to be as effective as video clips, but it is still a viable option. Just bring your good dog to some fascinating events, and share how it becomes a part of the parties.

## Conclusions

In this project we wrangled and analyzed the WeRateDogs Twitter archive, and found some interesting insights which enable us to get a better understanding of how WeRateDogs works. We know Matt and his team have to persistently work hard to generate humorous Tweets by using all possible means to entertain dog lovers. We confirmed that [the rating system really sucks](http://knowyourmeme.com/memes/theyre-good-dogs-brent) after we put the rating data to the analysis, and realized they are not taking the rating system seriously at all. And finally, we know what to do if we happen to feel we are obliged to share the cuteness of our own dogs with other fans around the world.