# Project: Wrangle and Analyze Data

## Introduction

Real-world data rarely comes clean. Using Python and its libraries, you will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. You will document your wrangling efforts in a Jupyter Notebook, plus showcase them through analyses and visualizations using Python (and its libraries) and/or SQL.

The dataset that you will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user [@dog_rates](https://twitter.com/dog_rates), also known as [WeRateDogs](https://en.wikipedia.org/wiki/WeRateDogs). WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "[they're good dogs Brent](http://knowyourmeme.com/memes/theyre-good-dogs-brent)." WeRateDogs has over 4 million followers and has received international media coverage.

WeRateDogs [downloaded their Twitter archive](https://support.twitter.com/articles/20170160) and sent it to Udacity via email exclusively for you to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017. More on this soon.

<img src = "https://video.udacity-data.com/topher/2017/October/59dd378f_dog-rates-social/dog-rates-social.jpg" height =200px, width = 400px>
<center>Image via <a href = "http://www.bostonmagazine.com/arts-entertainment/blog/2017/04/18/dog-rates-mit/">Boston Magazine</a></center>



## What Software Do I Need?

The entirety of this project can be completed inside the Udacity classroom on the **Project Workspace: Complete and Submit Project** page using the Jupyter Notebook provided there. (Note: This Project Workspace may not be available in all versions of this project, in which case you should follow the directions below.)

If you want to work outside of the Udacity classroom, the following software requirements apply:

* You need to be able to work in a Jupyter Notebook on your computer. Please revisit our Jupyter Notebook and Anaconda tutorials earlier in the Nanodegree program for installation instructions.
* The following packages (libraries) need to be installed. You can install these packages via conda or pip. Please revisit our Anaconda tutorial earlier in the Nanodegree program for package installation instructions.
  * pandas
  * NumPy
  * requests
  * tweepy
  * json
* You need to be able to create written documents that contain images and you need to be able to export these documents as PDF files. This task can be done in a Jupyter Notebook, but you might prefer to use a word processor like [Google Docs](https://www.google.com/docs/about/), which is free, or Microsoft Word.
* A text editor, like [Sublime](https://www.sublimetext.com/), which is free, will be useful but is not required.

## Project Motivation

### Context

Your goal: wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

### The Data

### Enhanced Twitter Archive

The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which I used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, I have filtered for tweets with ratings only (there are 2356).
Extracted data from tweet text

<img src = "https://video.udacity-data.com/topher/2017/October/59dd4791_screenshot-2017-10-10-18.19.36/screenshot-2017-10-10-18.19.36.png" height = 400px, width= 800px>

<center><em><strong>The extracted data from each tweet's text</strong></em></center>

I extracted this data programmatically, but I didn't do a very good job. The ratings probably aren't all correct. Same goes for the dog names and probably dog stages (see below for more information on these) too. You'll need to assess and clean these columns if you want to use them for analysis and visualization.

<img src="https://video.udacity-data.com/topher/2017/October/59e04ceb_dogtionary-combined/dogtionary-combined.png" height = 400px, width=900px>

<center><em><strong>The Dogtionary explains the various stages of dog: doggo, pupper, puppo, and floof(er) (via the <a href = "https://www.amazon.com/WeRateDogs-Most-Hilarious-Adorable-Youve/dp/1510717145">#WeRateDogs</a> book on Amazon)</strong></em></center>

### Additional Data via the Twitter API

Back to the basic-ness of Twitter archives: retweet count and favorite count are two of the notable column omissions. Fortunately, this additional data can be gathered by anyone from Twitter's API. Well, "anyone" who has access to data for the 3000 most recent tweets, at least. But you, because you have the WeRateDogs Twitter archive and specifically the tweet IDs within it, can gather this data for all 5000+. And guess what? You're going to query Twitter's API to gather this valuable data.

### Image Predictions File

One more cool thing: I ran every image in the WeRateDogs Twitter archive through a [neural network](https://www.youtube.com/watch?v=2-Ol7ZB0MmU) that can classify breeds of dogs*. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

<img src="https://video.udacity-data.com/topher/2017/October/59dd4d2c_screenshot-2017-10-10-18.43.41/screenshot-2017-10-10-18.43.41.png" width = 900px, height=400px>
<center><em><strong>Tweet image prediction data</strong></em></center>

So for the last row in that table:

* tweet_id is the last part of the tweet URL after "status/" → https://twitter.com/dog_rates/status/889531135344209921
* p1 is the algorithm's #1 prediction for the image in the tweet → <strong>golden retriever</strong>
* p1_conf is how confident the algorithm is in its #1 prediction → <strong>95%</strong>
* p1_dog is whether or not the #1 prediction is a breed of dog → <strong>TRUE</strong>
* p2 is the algorithm's second most likely prediction → Labrador <strong>retriever</strong>
* p2_conf is how confident the algorithm is in its #2 prediction → <strong>1%</strong>
* p2_dog is whether or not the #2 prediction is a breed of dog → <strong>TRUE</strong>
* etc.

And the #1 prediction for the image in that tweet was spot on:

<img src="https://video.udacity-data.com/topher/2017/October/59dd4e05_dog-pred/dog-pred.png" width = 250px, height=200px>
<center><strong>A golden retriever named Stuart</strong></center>

So that's all fun and good. But all of this additional data will need to be gathered, assessed, and cleaned. This is where you come in.

### Key Points

Key points to keep in mind when data wrangling for this project:
<ul>
    <li>You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.</li>
    <li>Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.</li>
    <li>Cleaning includes merging individual pieces of data according to the rules of <a href="https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html">tidy data</a>.</li>
    <li>The fact that the rating numerators are greater than the denominators does not need to be cleaned. This <a href="http://knowyourmeme.com/memes/theyre-good-dogs-brent">unique rating system</a> is a big part of the popularity of WeRateDogs.</li>
    <li>You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.</li>
</ul>

### Project Details

Your tasks in this project are as follows:
<ul>
    <li>Data wrangling, which consists of:</li>
        <ul>
          <li>Gathering data (downloadable file in the Resources tab in the left most panel of your classroom and linked in step 1 below).</li>
           <li>Assessing data</li>
           <li>Cleaning data</li>
         </ul>
    <li>Storing, analyzing, and visualizing your wrangled data</li>
    <li>Reporting on 
        1) your data wrangling efforts and 
        2) your data analyses and visualizations</li>
</ul>

In [1]:
# Import statements
import pandas as pd
import numpy as np
import requests
import tweepy
import os
import json
import time
import re
import matplotlib.pyplot as plt
import warnings
from IPython.display import Image
from functools import reduce
import re
import seaborn as sns
import datetime
from jupyterthemes import jtplot
jtplot.style(theme='onedork')

% matplotlib inline

ModuleNotFoundError: No module named 'jupyterthemes'

In [None]:
!pip install jupyterthemes==0.16.1

# Gather

In [None]:
!pip install tweepy

In [None]:
# Open the csv file
df_twitter_archive = pd.read_csv('twitter-archive-enhanced-2.csv')
df_twitter_archive.head()

**Tweet image prediction**

In [None]:
# Download the image prediction file using the link provided to Udacity students
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
image_request = requests.get(url, allow_redirects=True)

open('image_predictions.tsv', 'wb').write(image_request.content)

In [None]:
# Showing the data in the image predictions file
df_image_predictions = pd.read_csv('image_predictions.tsv', sep = '\t')
df_image_predictions.head()

**Ref: https://stackoverflow.com/questions/28384588/twitter-api-get-tweets-with-specific-id**

In [None]:
auth = tweepy.OAuthHandler('5Uur0mo4ol2kB8yhtZ1VxXS0u', 'h8E7fSpXWiMoBel7G1ZOAeu4Mgru0v0MtxH5ehYE1RKM89SiBH')
auth.set_access_token('303562412-ct9aNnU0FQR0UKJVn1i1W3Y8omqSewiQWUcRaygB', 'D3qslrbdOU5fqTOp951kOIuZbkeTPBodnjNYoEGFR63Ft')
api = tweepy.API(auth, 
                 parser = tweepy.parsers.JSONParser(), 
                 wait_on_rate_limit = True, 
                 wait_on_rate_limit_notify = True)

**Twitter API & JSON**

In [None]:
#Download Tweepy status object based on Tweet ID and store in list
list_of_tweets = []
# Tweets that can't be found are saved in the list below:
cant_find_tweets_for_those_ids = []
for tweet_id in df_twitter_archive['tweet_id']:   
    try:
        list_of_tweets.append(api.get_status(tweet_id))
    except Exception as e:
        cant_find_tweets_for_those_ids.append(tweet_id)

In [None]:
#Printing 
print("The list of tweets" ,len(list_of_tweets))
print("The list of tweets no found" , len(cant_find_tweets_for_those_ids))

In [None]:
#Then in this code block we isolate the json part of each tweepy 
#status object that we have downloaded and we add them all into a list
my_list_of_dicts = []
for each_json_tweet in list_of_tweets:
    my_list_of_dicts.append(each_json_tweet)

In [None]:
#we write this list into a txt file:
with open('tweet_json.txt', 'w') as file:
        file.write(json.dumps(my_list_of_dicts, indent=4))

In [None]:
#identify information of interest from JSON dictionaries in txt file
#and put it in a dataframe called tweet JSON
my_demo_list = []
with open('tweet_json.txt', encoding='utf-8') as json_file:  
    all_data = json.load(json_file)
    for each_dictionary in all_data:
        tweet_id = each_dictionary['id']
        whole_tweet = each_dictionary['text']
        only_url = whole_tweet[whole_tweet.find('https'):]
        favorite_count = each_dictionary['favorite_count']
        retweet_count = each_dictionary['retweet_count']
        followers_count = each_dictionary['user']['followers_count']
        friends_count = each_dictionary['user']['friends_count']
        whole_source = each_dictionary['source']
        only_device = whole_source[whole_source.find('rel="nofollow">') + 15:-4]
        source = only_device
        retweeted_status = each_dictionary['retweeted_status'] = each_dictionary.get('retweeted_status', 'Original tweet')
        if retweeted_status == 'Original tweet':
            url = only_url
        else:
            retweeted_status = 'This is a retweet'
            url = 'This is a retweet'

        my_demo_list.append({'tweet_id': str(tweet_id),
                             'favorite_count': int(favorite_count),
                             'retweet_count': int(retweet_count),
                             'followers_count': int(followers_count),
                             'friends_count': int(friends_count),
                             'url': url,
                             'source': source,
                             'retweeted_status': retweeted_status,
                            })
        tweet_json = pd.DataFrame(my_demo_list, columns = ['tweet_id', 'favorite_count','retweet_count', 
                                                           'followers_count', 'friends_count','source', 
                                                           'retweeted_status', 'url'])

In [None]:
tweet_json.info()

## Assessing data

* (**Visual assessment**)  Each piece of gathered data is displayed in the Jupyter Notebook for visual assessment purposes.

In [None]:
df_twitter_archive

In [None]:
df_image_predictions

In [None]:
tweet_json

* (**Programmatic assessment**) Pandas' functions and/or methods are used to assess the data.

In [None]:
df_twitter_archive.info()

In [None]:
df_image_predictions.info()

In [None]:
tweet_json.info()

**Archive Dataframe Analysis**

In [None]:
df_twitter_archive.rating_numerator.value_counts()

In [None]:
print(df_twitter_archive.loc[df_twitter_archive.rating_numerator == 204, 'text']) 
print(df_twitter_archive.loc[df_twitter_archive.rating_numerator == 143, 'text']) 
print(df_twitter_archive.loc[df_twitter_archive.rating_numerator == 666, 'text']) 
print(df_twitter_archive.loc[df_twitter_archive.rating_numerator == 1176, 'text'])
print(df_twitter_archive.loc[df_twitter_archive.rating_numerator == 144, 'text'])

In [None]:
#print whole text in order to verify numerators and denominators
#17 dogs
print(df_twitter_archive['text'][1120]) 
#13 dogs
print(df_twitter_archive['text'][1634])
#just a tweet to explain actual ratings, this will be ignored when cleaning data
print(df_twitter_archive['text'][313]) 
#no picture, this will be ignored when cleaning data
print(df_twitter_archive['text'][189]) 
#12 dogs
print(df_twitter_archive['text'][1779]) 

In [None]:
df_twitter_archive.rating_denominator.value_counts()

In [None]:
print(df_twitter_archive.loc[df_twitter_archive.rating_denominator == 11, 'text']) 
print(df_twitter_archive.loc[df_twitter_archive.rating_denominator == 2, 'text']) 
print(df_twitter_archive.loc[df_twitter_archive.rating_denominator == 16, 'text']) 
print(df_twitter_archive.loc[df_twitter_archive.rating_denominator == 15, 'text'])
print(df_twitter_archive.loc[df_twitter_archive.rating_denominator == 7, 'text'])

In [None]:
#retweet - it will be deleted when delete all retweets
print(df_twitter_archive['text'][784]) 
#actual rating 14/10 need to change manually
print(df_twitter_archive['text'][1068]) 
#actual rating 10/10 need to change manually
print(df_twitter_archive['text'][1662]) 
#actual rating 9/10 need to change manually
print(df_twitter_archive['text'][2335]) 
#tweet to explain rating
print(df_twitter_archive['text'][1663]) 
#no rating - delete
print(df_twitter_archive['text'][342]) 
#no rating - delete
print(df_twitter_archive['text'][516]) 

In [None]:
df_twitter_archive['name'].value_counts()

In [None]:
df_twitter_archive[df_twitter_archive.tweet_id.duplicated()]

In [None]:
df_twitter_archive.describe()

**Image Dataframe Analysis**

In [None]:
df_image_predictions.sample(5)

In [None]:
# This is an image for tweet_id 856282028240666624
Image(url = 'https://pbs.twimg.com/media/C-If9ZwXoAAfDX2.jpg')

In [None]:
df_image_predictions.info()

In [None]:
df_image_predictions[df_image_predictions.tweet_id.duplicated()]

In [None]:
df_image_predictions['p1'].value_counts()

In [None]:
df_image_predictions['p2'].value_counts()

In [None]:
df_image_predictions['p3'].value_counts()

**Twitter Counts Dataframe Analysis**

In [None]:
tweet_json.head()

In [None]:
tweet_json.info()

In [None]:
tweet_json.describe()

## Clean

This section consists of the cleaning portion of the data wrangling process:

* Define
* Code
* Test


In [None]:
# Make a copy of the tables before cleaning
df_twitter_archive_clean = df_twitter_archive.copy()
df_image_predictions_clean = df_image_predictions.copy()
tweet_json_clean = tweet_json.copy()

#### Define

1. Merge the `clean versions` of `df_twitter_archive`, `df_image_predictions`, and `tweet_json` dataframes Correct the dog types
2. Create one column for the various dog types: doggo, floofer, pupper, puppo Remove columns no longer needed: in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, and retweeted_status_timestamp
3. Delete retweets
4. Remove columns no longer needed
5. Change tweet_id from an integer to a string
6. Change the timestamp to correct datetime format
7. Correct naming issues
8. Standardize dog ratings
9. Creating a new dog_breed column using the image prediction data

* Merge the clean versions of df_twitter_archive, df_image_predictions, and tweet_json dataframes Correct the dog types

**Code**

In [None]:
# Ref: https://stackoverflow.com/questions/44327999/python-pandas-merge-multiple-dataframes/44338256
dfs = pd.concat([df_twitter_archive_clean, df_image_predictions_clean, tweet_json_clean], join='outer', axis=1)

In [None]:
dfs.head()

In [None]:
dfs.columns

**Test**

In [None]:
dfs.info()

* **Code and Test**: Create one column for the various dog types: doggo, floofer, pupper, puppo

In [None]:
# Extract the text from the columns into the new dog_type colunn
dfs['dog_type'] = dfs['text'].str.extract('(doggo|floofer|pupper|puppo)')

In [None]:
dfs[['dog_type', 'doggo', 'floofer', 'pupper', 'puppo']].sample(5)

In [None]:
dfs.head()

In [None]:
dfs.columns

In [None]:
dfs.dog_type.value_counts()

* **Code and Test**: Delete retweets

In [None]:
dfs = dfs[np.isnan(dfs.retweeted_status_id)]

In [None]:
#Verify no non-null entires are left
dfs.info()

In [None]:
# Remove the following columns:
dfs = dfs.drop(['retweeted_status_id', \
                                  'retweeted_status_user_id', 'retweeted_status_timestamp'], axis=1)

In [None]:
dfs.info()

*  **Code and Test**: Remove columns no longer needed

In [None]:
dfs.drop(['in_reply_to_status_id', 
                  'in_reply_to_user_id',
                  'source',
                  'img_num',
                  'friends_count',
                  'source',
                  'url',
                  'followers_count'], axis = 1, inplace=True)

In [None]:
# Ref: https://stackoverflow.com/questions/14984119/python-pandas-remove-duplicate-columns
dfs = dfs.loc[:,~dfs.columns.duplicated()]

In [None]:
dfs.columns

In [None]:
dfs.info()

In [None]:
dfs.drop(['retweeted_status'], axis = 1, inplace=True)

In [None]:
dfs.info()

* **Code and Test**: Change tweet_id from an integer to a string

In [None]:
dfs['tweet_id'] = dfs['tweet_id'].astype(str)

In [None]:
dfs.info()

 * **Code and Test**: Timestamps to datetime format

In [None]:
#Remove the time zone from the 'timestamp' column
dfs['timestamp'] = dfs['timestamp'].str.slice(start=0, stop=-6)

In [None]:
# Change the 'timestamp' column to a datetime object
dfs['timestamp'] = pd.to_datetime(dfs['timestamp'], format = "%Y-%m-%d %H:%M:%S")

In [None]:
dfs.head(1)

* **Code and Test**: Correct naming issues

In [None]:
dfs.name = dfs.name.str.replace('^[a-z]+', 'None')

In [None]:
dfs['name'].value_counts()

In [None]:
dfs['name'].sample(10)

* **Code and Test**: Standardize dog ratings

In [None]:
dfs['rating_numerator'] = dfs['rating_numerator'].astype(float)

In [None]:
dfs['rating_denominator'] = dfs['rating_denominator'].astype(float)

In [None]:
dfs.info()

In [None]:
# For loop to gather all text, indices, and ratings for tweets that contain a decimal in the numerator of the rating
ratings_decimals_text = []
ratings_decimals_index = []
ratings_decimals = []

for i, text in dfs['text'].iteritems():
    if bool(re.search('\d+\.\d+\/\d+', text)):
        ratings_decimals_text.append(text)
        ratings_decimals_index.append(i)
        ratings_decimals.append(re.search('\d+\.\d+', text).group())

# Print ratings with decimals        
ratings_decimals_text

In [None]:
# Print the indices of the ratings above (have decimal)
ratings_decimals_index

In [None]:
#Correctly converting the above decimal ratings to float
dfs.loc[ratings_decimals_index[0],'rating_numerator'] = float(ratings_decimals[0])
dfs.loc[ratings_decimals_index[1],'rating_numerator'] = float(ratings_decimals[1])
dfs.loc[ratings_decimals_index[2],'rating_numerator'] = float(ratings_decimals[2])
dfs.loc[ratings_decimals_index[3],'rating_numerator'] = float(ratings_decimals[3])

In [None]:
# Testing the indices 
dfs.loc[40]

In [None]:
Image(url = 'https://pbs.twimg.com/media/CUCQTpEWEAA7EDz.jpg')

In [None]:
# Create a new column called rating, and calulate the value with new, standardized ratings
dfs['rating'] = dfs['rating_numerator'] / dfs['rating_denominator']

In [None]:
dfs.sample(10)

In [None]:
dfs.loc[30]

In [None]:
dfs.rating.head()

* **Clean and Test**: Creating a new dog_breed column using the image prediction data

In [None]:
dfs['dog_breed'] = 'None'

for i, row in dfs.iterrows():

    if row.p1_dog:
        dfs.set_value(i, 'dog_breed', row.p1)
    elif row.p2_dog and row.rating_numerator >= 10:
        dfs.set_value(i, 'dog_breed', row.p2)
    elif row.p3_dog and row.rating_numerator >= 10:
        dfs.set_value(i, 'dog_breed', row.p3)
    else:
        dfs.set_value(i, 'dog_breed', 'None')

In [None]:
dfs.dog_breed.value_counts()

### Storing, Analyzing, and Visualizing Data

This section provides an analysis of the data set, and corresponding visualizations to draw valuable conclusions.

   1. Visualizing the total number of tweets over time to see whether that number increases, or decreases, over time.
   2. Visualizing the retweet counts, and favorite counts comparison over time.
   3. Visualizing the most popular dog breed
   4. Visualizing the most popular dog names


In [None]:
# Storing the new twitter_dogs df to a new csv file
dfs.to_csv('twitter_archive_master.csv', encoding='utf-8', index=False)

* **Analyze and Visualize**: Visualizing the total number of tweets over time to see whether that number increases, or decreases, over time.

In [None]:
dfs.timestamp = pd.to_datetime(dfs['timestamp'], format='%Y-%m-%d %H:%M:%S.%f')

monthly_tweets = dfs.groupby(pd.Grouper(key = 'timestamp', freq = "M")).count().reset_index()
monthly_tweets = monthly_tweets[['timestamp', 'tweet_id']]
monthly_tweets.head()
monthly_tweets.sum()

In [None]:
# Plotting time vs. tweets

plt.figure(figsize=(10, 10));
plt.xlim([datetime.date(2015, 11, 30), datetime.date(2017, 7, 30)]);

plt.xlabel('Year and Month')
plt.ylabel('Tweets Count')

plt.plot(monthly_tweets.timestamp, monthly_tweets.tweet_id);
plt.title('We Rate Dogs Tweets over Time');

Over time tweets decreased sharply, with spikes in activity during the early  of 2016(Jan), 2016(Mar), and generally decreasing from there.

* **Analyze and Visualize**: Visualizing the retweet counts, and favorite counts comparison over time.

In [None]:
# Scatterplot of retweets vs favorite count

sns.lmplot(x="retweet_count", 
           y="favorite_count", 
           data=dfs,
           size = 5,
           aspect=1.3,
           scatter_kws={'alpha':1/5});

plt.title('Favorite Count vs. Retweet Count');
plt.xlabel('Retweet Count');
plt.ylabel('Favorite Count');

Favorite counts are correlated with retweet counts - this is a positive correlation.

* **Analyze and Visualize**: Visualizing the most popular dog breed

In [None]:
dfs['dog_type'].value_counts()

The most popular dog breed is a golden retriever, with a labrador retriever coming in as the second most popular breed.

In [None]:
# Histogram to visualize dog breeeds
dog_breed = dfs.groupby('dog_breed').filter(lambda x: len(x) >= 25)

dog_breed['dog_breed'].value_counts().plot(kind = 'barh')
plt.title('Most Rated Dog Breed')
plt.xlabel('Count')
plt.ylabel('Breed of dog');

* **Analyze and Visualize**: Visualizing the most popular dog names

In [None]:
dfs.name.value_counts()[0:7].plot('barh', figsize=(15,8), title='Most Common Dog Names').set_xlabel("Number of Dogs");

In [None]:
dfs.name.value_counts()

The three most popular dog names are:
1. Lucy - 11
2. Charlie - 11
3. Oliver - 10 and so on