# Project: Wrangling and Analyzing Data

## Introduction

In this project, I will be using data from @WerateDog for my analysis.
I will gather the data from three different sources, access it and clean it to the degree it is fit for my analysis.
I will then explore the cleaned data to draw insight on dog breeds that are more popular among dog lovers.


### Note:
Import all libraries need for this project

In [1]:
# Import neccessary libraray
import pandas as pd
import numpy as np
import requests
import json
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# import tweepy

## Data Gathering

### Note:
Load the twitter archive dataset to the dataframe.

In [2]:
# Load the twitter archive dataset to the dataframe.
archive = pd.read_csv("twitter-archive-enhanced.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'twitter-archive-enhanced.csv'

In [None]:
# Take a look at the first few rows to confirm 
archive.head()

### Note:
Load the image predictions dataset to the dataset using the requests library.

In [None]:
# url 
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"



In [None]:
# Requests to get url
response = requests.get(url)

In [None]:
# write content to file
with open("image_file", "wb") as my_file:
    my_file.write(response.content)

In [None]:
# load the created file into dataframe.
image = pd.read_csv("image_file", delimiter="\t")

In [None]:
image.head(3)

#### Note:
Gather additional twitter data.

In [None]:
tweet = pd.read_json("tweet.json", lines=True)

In [None]:
tweet.info()

### Note:
The id, retweet_count and favourite_count columns are the only columns useful for the analysis.
Hence, Only this data will be gathered.



In [None]:
# Take the dataframe of this columns and store.
tweet = tweet[["id", "retweet_count", "favorite_count"]]

In [None]:
# A quick look.
tweet.info()

## Accesing Data
In this section, I will be accesing all the datasets gathered in the Data gathering section. This will be done both visually and programmatically.

Accessing the data includes identifying and putting  down both quality and tidiness issues that may be found with the datasets.

This issues will later be cleaned in the "Cleaning Data" section.

### Note:
Visually access the archive datasets, indentifying both quality and tidiness issues.


### Note:
I will set the number of rows in pandas to unlimited to be able to visually access files down the rows.

I will also use microsoft excel to visualize access data.

In [None]:
# Display archive data

archive

###  Note:
In the next set of cells, I will be accessing the twitter archive data programmatically.

In [None]:
# Check archive information programmably to access data for data type, 
# number of rows and column and missing data.

archive.info()

In [None]:
# Access integer data for incorrect data such as outliers and the likes.

archive.describe()

In [None]:
# Access data for the unique values of rating denominator for values other than 10.

archive["rating_denominator"].value_counts()

In [None]:
# Check for the number of data that has the rating_numerator less than or equal to rating_denominator.

(archive["rating_numerator"] <= archive["rating_denominator"]).value_counts()

In [None]:
# Check for the number of data that has the rating_numerator less than 10.


(archive["rating_numerator"] <10).value_counts()

In [None]:
# Check for the number of data that has the rating_denominator not equal to 10. 

(archive["rating_denominator"] !=10).value_counts()

In [None]:
# Check for the number of data that has the rating_denominator greater than 10.
archive[archive["rating_denominator"] > 10]

In [None]:
# Access the names of dogs for missing value, wrong spelling and incorrect data

archive["name"]

In [None]:
# Check the number of floofers in the floofer column.

archive["floofer"].value_counts()

In [None]:
# Access p1 column for unique values and errors in data.

archive["source"].value_counts()



In [None]:
# Visually access in_reply_to_user_id for missing and incorrect data.
archive["in_reply_to_user_id"]

### Note: 
Visual assessment of image information data in the following cell

In [None]:
image

### Note: 
Programmatic assessment of image information data in the following cell

In [None]:
# Check image information programmably to access data for data type, 
# number of rows and columns and missing data.

image.info()

In [None]:
# Access integer data for incorrect data such as outliers and the likes.

image.describe()

In [None]:
# Access p1 column for unique values and errors in data.

image["p1"].value_counts()

### Note:

Visually access the tweet dataset, indentifying both quality and tidiness issues.

In [None]:
tweet

### Note:
In the next set of ceells, I will be accessing the tweet data using programmatical means.


In [None]:
# Check tweet information programmably to access data for data type.
# number of rows and column and missing data.

tweet.info()

In [None]:
# Access integer data for incorrect data such as outliers and the likes.
# Also for statistical summary of data.
tweet.describe()

In [None]:
archive[archive["rating_numerator"]  == archive["rating_numerator"].max()]

### Note:

I highlight below the issues found with the datasets. These issues comprises both quality and tidiness issues(messy and untidy data). 
This issues were discovered through visual and programmatic assessments.
The issues will be cleaned in the data cleaning sections.

### Quality issues.

#### twitter archive:

- Some of the rating denominators has values other than 10
    - I will be taking 10 as the standard rating value which dog rating values will be measured.
    -  I'm only considering one dog at a time not multiple dogs at once.
- The timestamp column is in a string data type instead of datetime
- The tweet id is in integer format instead of string.
- The doggo, floffer, pupper and puppo data are in string data type instead of category datatype.
- Some values starts with lowercase in p1, p2 and p3 data.
- Underscore in p1, p2 and p3 instead of space.
- Html tag in source column.
- There are data that are not about dogs in the dataset.
- There are dog tweets with no image in the dataset.
- There are outliers in the rating_numerator column. Values as high as 1776
- Missing data in the following columns:
    - in_reply_to_status_id
    - in_reply_to_user_id
    - retweeted_status_id
    - retweeted_status_user_id
    - retweeted_status_timestamp
    - expanded_urls


#### Image prediction:


- Tweet id is in integer format instead of string.


#### Tweet:

- The id is in integer data type instead of string data type.
- favourite_retweet is not simple enough to describe the column name. total_retweet will be better.
- favourite_count is not simple enough to describe the column name. total_likes will be better.



### Tidiness issues

- The dog_stages columns: floofer, doggo, pupper and poppo are in different columns instead of one.
- In the image prediction dataset, there are too many columns about the strength of the predictions.
    - Cleaning should be performed on these prediction columns(p1,p1_conf, p1_dog p2, etc..) to produce a resultant outcome of fewer columns.
- Tweet id appears in all the datasets instead of one.
    - It's understandable because it will be useful for merging the datasets. Only one tweet_id data will be retained after merging. Other ones will be dropped.

The following columns were dropped from the image dataset after the result of the prediction data(p1, p1_conf, p1_dog, p2, etc.) were used to separate dog records from non-dog records.

- jpg_url
- img_num
- p1
- p1_conf
- p1_dog
- p2
- p2_conf	
- p2_dog	
- p3	
- p3_conf	
- p3_dog


Other columns dropped include:

- prediction
- name
- dog_stage
- source
- date

## Data Cleaning

### Take a copy of all the three datasets for cleaning purpose

In [None]:
archive_clean = archive.copy()
image_clean = image.copy()
tweet_clean = tweet.copy()

### Missing Data

The instruction states that retweeted tweets should be excluded from the analysis.
Such data will be dropped.

### Define
Drop every rows that contains retweeted status id.

### Code:

In [None]:
# Check the data information for name of column  before dropping.
archive_clean.info()

In [None]:
# Drop every rows that contains retweeted status id 

archive_clean = archive_clean[~archive_clean["retweeted_status_id"].notnull()]

### Test

In [None]:
# Confirm result.
archive_clean.info()

In [None]:
# Visually confirm the result.
archive_clean.head(500)


### Fixing missing data:

The instruction says to only consider tweet with images.
expanded_urls column contains links of tweets with images.

### Define
I will be taking out only the tweet data with images for analysis. This implies that tweets data with no images are dropped. 

### Code

In [None]:
# Take out the data with no missing image data.
archive_clean = archive_clean[archive_clean["expanded_urls"].notnull()]

### Test

In [None]:
# Check the new the dataset for null values. 
archive_clean["expanded_urls"].isnull().sum() 

# A result of zero confirms we now have only tweets with images.

In [None]:
# Check data information to confirm.
archive_clean.info()

### Tidiness

The dog_stages columns: floofer, doggo, pupper and poppo are in different columns instead of one.


### Define
Combine the values of the dog stages to form one column

### Code

In [None]:
archive_clean[["doggo", "floofer", "pupper", "puppo"]].value_counts()

In [None]:
# Write a function with two parameters.
def drop_col(dframe, row_col, axis=0):
    dframe.drop(row_col, axis=axis, inplace=True)

In [None]:
# A list of dog stages
dog_stages = ['doggo', 'floofer', 'pupper', 'puppo']

# Replace none with non values after creating a new column for dog stages.

archive_clean[dog_stages] = archive_clean[dog_stages].replace('None', np.nan)
#twitter_archive_clean[['doggo', 'floofer', 'pupper', 'puppo']]

def join_all(x):
    return ', '.join(x.dropna().astype(str))

    [[ 'floofer' , 'puppo']] ['','floofer','', 'puppo'] = [' ,floofer, ,puppo']

archive_clean['dog_stages'] = archive_clean[dog_stages].apply(join_all, axis=1)

# Replace empty string with non values.
archive_clean['dog_stages'] = archive_clean['dog_stages'].replace('', np.nan)



In [None]:
# Drop these columns : doggo, floofer, pupper, puppo
drop_col(archive_clean, dog_stages, axis=1)

### Test

In [None]:
# Check data information to confirm.
archive_clean.info()

In [None]:
# Visually check the dog_stages column for changes
archive_clean

#### Take 10 as the standard rating value for all dog ratings in the analysis
#### This will also remove the incorrect data and outliers in the rating_denominator column.


### Define

Remove the data with rating denominators other than 10 from the twitter archive data for the sake of consistency.

### Code

In [None]:
# Remove the data with rating denominators other than 10 from the twitter archive data for the sake of consistency.

archive_clean  = archive_clean[archive_clean["rating_denominator"] == 10]

### Test

In [None]:
# Check data information to confirm change.

archive_clean.info()

In [None]:
# Confirm that the only data in the rating_denominator column is 10.

archive_clean["rating_denominator"].value_counts()

### Fix timestamp column

### Define
Convert string data type to datetime in archive_clean dataset.

### Code



In [None]:
# Convert string data type to datetime in archive_clean dataset.

archive_clean["timestamp"] = pd.to_datetime(archive_clean["timestamp"])

### Test

In [None]:
archive_clean.info()

#### Fix data type issue with tweet_id

#### Define
Convert integer data  to string data in tweet_id column.

#### Code

In [None]:
# Convert integer data  to string data.
archive_clean["tweet_id"] = archive_clean["tweet_id"].astype("str")

### Test

In [None]:
# Check information to confirm change.

archive_clean.info()

### Get rid of html tags in source columns.

### Define
Extract the values from the html tags in source column.



#### Code

In [None]:
# This code extract the text element from the html tags.
archive_clean["source"]  = archive_clean["source"].str.split(">", expand=True)[1].str.split("<", expand=True)[0]

### Test

In [None]:
# Print first five rows to confirm change.

archive_clean["source"].head()

### Fixing data type issues with tweet_id and id in image and tweet dataset


#### Define
Convert integer data to string data for tweet_id and id in image and tweet dataset

##### Code


In [None]:
# Fix data type issue with tweet_id and id column.
image_clean["tweet_id"] = image_clean["tweet_id"].astype("str")
tweet_clean["id"] = tweet_clean["id"].astype("str")

#### Test

In [None]:
# Check data information to confirm change.

image_clean.info()

In [None]:
# Check data information to confirm change.

tweet_clean.info()

In [None]:
image.head()

### Underscore in p1, p2 and p3 instead of space

#### Define
Replace "_" with space in p1, p2 and p3

#### Code

In [None]:
# Replace "_" with space in p1, p2 and p3 column.

image_clean["p1"] = image_clean["p1"].str.replace("_", " ")
image_clean["p2"] = image_clean["p2"].str.replace("_", " ")
image_clean["p3"] = image_clean["p3"].str.replace("_", " ")

#### Test

In [None]:
# Print first five rows of p1, p2 and p3 column to confirm change.

image_clean[["p1","p2", "p3"]].head()

### Some values starts with lowercase in p1, p2 and p3 data of the image dataset

#### Define
Convert first value of each word to uppercase in p1, p2 and p3.

#### Code

In [None]:
# Convert first value of each word to uppercase in p1, p2 and p3.

image_clean["p1"]  = image_clean["p1"].str.title()
image_clean["p2"]  = image_clean["p2"].str.title()
image_clean["p3"]  = image_clean["p3"].str.title()

#### Test

In [None]:
# Print first five rows to confirm result.

image_clean[["p1","p2", "p3"]].head()

###  Fix tidiness Issues


 
In the image prediction dataset, there are too many columns about the strength of the predictions and their result.


Cleaning should be performed on these prediction columns(p1,p1_conf, p1_dog p2, etc..) to produce a resultant outcome of fewer columns.


Create two new columns that will depict the same characteristics in the these columns(p1,p1_conf, p1_dog, p2, p2_conf, p2_dog, p3, p3_conf, p3_dog).

#### Define

Extract out the data whose prediction is true for dogs.



#### Code 

In [None]:
# Condition for the data whose prediction is true for dogs.

condition = [(image_clean["p1_dog"] == True), (image_clean["p2_dog"] == True), (image_clean["p3_dog"] == True)]

In [None]:
# List of the dog breeds.

dog_breed = [image_clean["p1"], image_clean["p2"], image_clean["p3"] ]

In [None]:
dog_breed

In [None]:
# List of the prediction(probablity) values.

prediction = [image_clean["p1_conf"], image_clean["p2_conf"], image_clean["p3_conf"]]

In [None]:
# Take  the dog breed and prediction values for dog data and store in a new created columns: dog_breed and prediction.

image_clean["dog_breed"] = np.select(condition, dog_breed, default=None)
image_clean["prediction"] = np.select(condition, prediction, default=0)

### Test

In [None]:
# Check dat information to confirm new created columns.

image_clean.info()

In [None]:
# Print first 10 rows to confirm change.

image_clean.head(10)

#### Define

Since only data about dog tweet is needed for the analysis,  remove other data.

#### Note:
The values in the dog type column with null values are non-dog tweets.

#### Code

In [None]:
# Extract out data where dog_breed values are not null.

image_clean = image_clean[image_clean["dog_breed"].notnull()]

####  Test

In [None]:
# Check data information to confirm non-dog tweets have been removed.

image_clean.info()

### Remove outliers in the rating_numerator column of twitter archive data.

- There are outliers in the rating_numerator column. 

rating_numerator values are between 0 and 14 in the dataset.
However, there are other few rating_numerator values which comprises 27, 75, 420, 26 and 1776. They only appear once in the dataset. I will consider them as outliers and abnormal records. I will remove their data records from the dataset.



#### Define
Remove outliers from rating_numerator columns

#### Code

In [None]:
# Check unique values.

archive_clean["rating_numerator"].value_counts()

In [None]:
# List of outliers data.

outliers = [27, 75, 1776, 26, 420]

In [None]:
# for loop to remove all outlier data from the dataframe.

for outlier in outliers:
    archive_clean = archive_clean[archive_clean["rating_numerator"] !=  outlier]


#### Test

In [None]:
# Check unique values to confirm outliers have been removed.

archive_clean["rating_numerator"].value_counts()

In [None]:
# Check data information to confirm change.

archive_clean.info()

### Note:
Check the data information for all three datasets to see the columns that will not be useful for my anlysis. This columns will be dropped.

In [None]:
image_clean.info()

In [None]:
tweet_clean.info()

In [None]:
archive_clean.info()

### Drop columns not useful for analysis in archive datasets.

#### Define
Drop the following columns:

- in_reply_to_status_id
- in_reply_to_user_id
- text 
- retweeted_status_id
- retweeted_status_user_id
- retweeted_status_id 
- retweeted_status_user_id
- retweeted_status_timestamp
- expanded_urls
- rating_denominator

#### Code

In [None]:
# Print column names to know which columns to drop.

archive_clean.columns

In [None]:
# Drop columns not useful for analysis in archive datasets. 

archive_clean.drop(["in_reply_to_status_id", "in_reply_to_user_id", "text", 
                    "retweeted_status_id", "retweeted_status_user_id", "retweeted_status_id", 
                    "retweeted_status_user_id", "retweeted_status_timestamp", 
                    "expanded_urls", "rating_denominator"], axis=1, inplace=True)

#### Test

In [None]:
# Print first five rows to confirm change.
archive_clean.head()

In [None]:
# Check data information to confirm change.
archive_clean.info()

### Rename Columns in archive datasets.

#### Define
Rename Columns in archive datasets to relatable names.

#### Code

In [None]:
# Rename Columns in archive datasets to relatable names.

archive_clean.rename(columns={"rating_numerator" : "rating(over_10)", "timestamp":"date","dog_stages":"dog_stage"}, inplace=True)

#### Test

In [None]:
# Print first five rows to confirm change in column names.

archive_clean.head()

### Dropping other columns not useful for my analysis.

#### Define
Drop the following columns in  image dataset

- jpg_url
- img_num
- p1
- p1_conf
- p1_dog
- p2
- p2_conf	
- p2_dog	
- p3	
- p3_conf	
- p3_dog

#### Code

In [None]:
# Check data information for column names.
image_clean.info()

In [None]:
# Drop the unecessary columns in image dataset.

image_clean.drop(image_clean[image_clean.columns[1:12]], axis=1, inplace=True)

#### Test

In [None]:
# Print first five rows to confirm change.
image_clean.head(1)

In [None]:
# Check data information to confirm change.

image_clean.info()

### Rename retweet_count and favourite_count to better name.

#### Define
- Rename retweet_count to total_retweets
- Rename favourite_count to total_likes

#### Code

In [None]:
# Rename columns.

tweet_clean.rename(columns={"retweet_count": "total_retweets", "favorite_count": "total_likes"}, inplace=True)

#### test

In [None]:
# Print one row to confirm new column names.
tweet_clean.head(1)

### Merging dataframe

#### Define
Merge all the following dataset:
- image_clean
- archive_clean
- tweet_clean

#### Code

#### Note:
The image_clean data will be the standard for joining the other two datasets.
This is because the records of the image data have been filtered for only dog records previously. 
The prediction data which the dataset creator collected from machine learning algorithm  were used.


I should left join image_clean on other datasets. However, I have decided to use inner join instead to avoid null values. This is because left join will produce null values for other image_clean records that are not present in the other datasets.

In [None]:
# Merge image_clean and archive_clean with inner join.

image_archive = pd.merge(image_clean, archive_clean, on="tweet_id", how="inner")

In [None]:
# Merge the resulting dataframe of preceeding cell with tweet_clean dataframe with an inner join.

dog = pd.merge(image_archive, tweet_clean, left_on="tweet_id", right_on="id", how="inner")

#### Test

In [None]:
# Check data information to confirm change.

image_archive.info()

In [None]:
# Print first five rows.

image_archive.head()

In [None]:
# Check data information to confirm the new dataframe.

dog.info()

In [None]:
# Print first five rows of the new dataframe.

dog.head()

### Note: After I finished merging and started data exploration, I found that there are some columns still present in the dataset  which will not be useful for my analysis. I will further be doing some cleaning and dropping these columns.

### Drop id column from the new dataframe.

#### Define
Drop id column from the new dataframe.

#### Code

In [None]:
# Drop id column from the new dataframe.

dog.drop(columns="id",inplace=True)

#### Test

In [None]:
# Check data information to confirm change.
dog.info()

### Columns should be re-ordered according to data types for easy access.

#### Define
Reorder columns

#### Code

In [None]:
# Reorder columns for easy access.

dog = dog[["tweet_id", "rating(over_10)", "total_likes", "total_retweets", "prediction", "dog_breed", "name", "dog_stage", "source", "date" ]]

#### Test

In [None]:
# Print first two rows to confirm change.
dog.head(2)

### It appears some of the columns above will not be useful for my analyis. I will drop them. This columns include the following:

- prediction
- name
- dog_stage
- source
- date

####  Why am I be dropping the above columns?

##### prediction:
The prediction data was used to determine the breeds of dogs. Its result also distinguish dog tweet data from non-dog tweet data. Since, I have used this information to get the needed data from the datasets, I will be dropping it because it's no more useful for my analysis.

##### name:
The name column contains some incorrect data such as "a", "an","the", "None" and other values. Since, these data would not give accurate information about the names of the dog, I will be dropping it.

##### dog_stage:
There are too many missing data in the dog_stage column. Hence, I will be removing from the dataset.

##### source:
This will be irrelevant for my analysis. Most of the tweets are tweeted with iPhone. I'm going to drop it.

#### date:
I will not be using the date data for my analysis. So, I will  drop it.



#### Define
Drop the following columns:
- prediction
- name
- dog_stage
- source
- date

#### Code

In [None]:
# Drop all the columns listed in the Define cell.

dog.drop(["prediction", "name", "dog_stage", "source", "date"], axis=1, inplace=True)

#### Test

In [None]:
# Check data information to confirm change.

dog.info()

In [None]:
# Print the first five rows to confirm change.

dog.head()

## Storing Data

In [None]:
# Store the final complete data.
dog.to_csv("twitter_archive_master.csv")

## Analyzing and Visualizing Data

### Note:
- Good dogs in this exploration are dogs that are rated 10 and above.

- Bad dogs are dogs that are given ratings below 10.

This idea was taken from the dog rating creator(@WeRateDogs) when he replied Brant. Brant had frowned at why the poster rated   dogs over 10. His reply was that such dogs are good dogs.

### Research Question 1
#### Did increase in likes increase the number of retweets of dog lovers?

#### Note:
I will answer this question by plotting the total number of likes and retweets on a scatter plot to see if there is a positive correlaton between them.

In [None]:
# Print fist five row to view data.
dog.head()

In [None]:
# Scatter plot showing  the correlation of total likes and total retweets.

sns.set_style("darkgrid")
plt.figure(figsize=(9,8))
plt.scatter(x= dog["total_likes"], y=dog["total_retweets"])
plt.title("Total likes and retweets of dog lovers", fontsize= 14)
plt.xlabel("Total number of likes", fontsize= 13)
plt.ylabel("Total number of Retweets", fontsize= 13)
plt.show()

### Insight:

I found that the there is a positive correlation between the total number of likes and retweets of dog tweets among dog lovers.

This means that dog lovers tend to retweet the dog tweets that they liked more.
They are more likely to retweet a dog tweet when they like it.

### Research Question 2

####  Which group of dogs did dog lovers retweeted more?

- #### Good or bad dogs?


#### Note:
I will answer this question by plotting the dog ratings and total number of retweets per dog on a scatter plot to see if there is a positive correlaton between them.

In [None]:
# Scatter plot showing  the correlation of the ratings and total retweets.

plt.figure(figsize=(8,7))
plt.scatter(x= dog["rating(over_10)"], y=dog["total_retweets"])
plt.title("Dog ratings and their retweets by dog lovers", fontsize= 14)
plt.xlabel("Dog Ratings", fontsize= 13)
plt.ylabel("Total number of retweets", fontsize= 13)
plt.show()

### Insight:

From the scatter plot, I found that the there is a positive correlation between the dog ratings and total number of retweets among dog lovers.

This means that dog lovers tend to retweet the tweets about good dogs more than they did with bad dogs.
This confirm that the dog poster rating align with dog lovers.

It can also be deduced from the above visualization that the higher the dog rating, the more the retweets.

### Research Question 3
####  Which group of dogs did dog lovers liked more?

- #### Good dogs or bad dogs?

#### Note:
I will answer this question by plotting the ratings of each dog and their total number of likes on a scatter plot to see if there is a positive correlaton between them.

In [None]:
# Scatter plot showing  the correlation of the dog likes and total retweets.

plt.figure(figsize=(8,7))
plt.scatter(x= dog["rating(over_10)"], y=dog["total_likes"])
plt.title("How dog lovers liked the dogs based on their ratings.", fontsize= 14)
plt.xlabel("Dog Ratings", fontsize= 13)
plt.ylabel("Total number of Likes", fontsize= 13)
plt.show()

### Insight:

From the scatter plot, I found that the there is a positive correlation between the dog ratings and total number of likes among dog lovers.
The higher the dog rating, the higher the number of likes.

The dog likes went to the highest levels with dogs with ratings of 10 and above(good dogs).
In fact, the likes went as high as over 120,000 likes for a few dogs with 13 over 10 ratings.

This means that dog lovers tend to like the tweets about good dogs more than they did with bad dogs.
This also confirms that the dog poster ratings align with dog lovers.

### Research Question 4
### 4.	What is the average rating of both groups of dogs?


### What is the average rating for good  dogs?

### Note:
- I will first take the mask of good and bad dogs. 
- Then, I will find the average ratings of both dog groups.

In [None]:
# dataframe for good and bad dogs.
good_dog = dog[dog["rating(over_10)"] >= 10]
bad_dog = dog[dog["rating(over_10)"] < 10]

In [None]:
# Print first five rows
good_dog.head()

In [None]:
# Print first five rows
bad_dog.head()

#### Average rating of good dogs.


In [None]:
# Find the average ratings of good dogs.
good_dog_rating = np.round(good_dog["rating(over_10)"].mean())

In [None]:
# show result
good_dog_rating

####  Average rating for bad dogs

In [None]:
# Average rating for bad dogs
bad_dog_rating  = np.round(bad_dog["rating(over_10)"].mean())

In [None]:
# show result

bad_dog_rating

#### Plot the average ratings of both groups on a bar chart.

In [None]:
# Functions to add value labels to a bar char.

def value_label(x,y):
    for i in x:
        plt.text(i, y[i]+0.2, y[i], ha="center")
        

# Please, note that this function will be called multiple times to add value label to bar charts.

In [None]:
# tick value for the x-axis of the bar chart

base = np.arange(2)

# labels of x-ticks of the bar chart
tick = ["Good dogs", "Bad dogs"]

In [None]:
# tick value for the y-axis of the bar chart

avg_rating = [good_dog_rating, bad_dog_rating]

In [None]:
# Bar chart showing the average ratings of both good and bad dogs.

plt.figure(figsize=(8,7))
plt.bar(base, avg_rating, width=0.5)
plt.title("The average rating of good and bad dogs", fontsize= 14)
plt.xlabel("Dogs", fontsize= 13)
plt.ylabel("Average rating", fontsize= 13)
plt.xticks(base, tick)

# Call function to add value labels to chart.
value_label(base, avg_rating)

plt.show()


### Insight:
The Good dogs have an average rating of 11 while bad dogs have an average rating of 8.

Good dogs are really good because they have ratings above 10.

Bad dogs are bad because they have an average rating below 10.

### Research Question 5
#### Which  group of dogs has more likes among  dog lovers?
-  #### Good or bad dogs?

#### Average likes for good dog

In [None]:
# Average likes for good dogs.

good_dog_likes = np.round(good_dog["total_likes"].mean())

In [None]:
# show result
good_dog_likes

#### Average likes for bad dogs

In [None]:
# Average likes for bad dogs.

bad_dog_likes = np.round(bad_dog["total_likes"].mean())

In [None]:
bad_dog_likes

#### Note:
I will plot the result on a bar chart in the next research question section.

In [None]:
# tick values of y axis.
avg_likes = [good_dog_likes, bad_dog_likes]

## Research Question 6
### Which dog group did dog lovers retweeted more? 
-  #### Good or bad dogs?


#### Note:
I will answer the question by finding the average retweet for good and bad dogs. I will then show the result on a bar chart.

#### Average retweet for good dogs.

In [None]:
# Average retweet for good dogs.

good_dog_retweets = np.round(good_dog["total_retweets"].mean())

In [None]:
# Show result
good_dog_retweets

#### Average retweet for bad breeds

In [None]:
# Average retweet for bad dogs.

bad_dog_retweets = np.round(bad_dog["total_retweets"].mean())

In [None]:
# Show result

bad_dog_retweets

#### Plot the average likes and retweets of good and bad dogs on a bar chart.

In [None]:

avg_retweets = [good_dog_retweets, bad_dog_retweets]

In [None]:
width=0.2

In [None]:
# Bar chart showing the average retweets of good and bad dogs among dog lovers.

plt.figure(figsize=(7,6))
avg_retweets_bar = plt.bar(base-width/2, avg_retweets, width, label="Average Retweet")
avg_likes_bar = plt.bar(base+width/2, avg_likes, width, label="Average Likes")

# Add value label to the bar chart.
plt.bar_label(avg_retweets_bar, padding=2)
plt.bar_label(avg_likes_bar, padding=2)


plt.title("The average Likes and Retweet of both groups of dogs among dog lovers", fontsize= 14)
plt.xlabel("Dogs", fontsize= 13)
plt.ylabel("Average Like and  Retweet", fontsize= 13)
plt.xticks(base, tick)


plt.legend()

plt.show()



#### Insight 1:
From the bar chart, dog lovers retweeted good dogs more than bad dogs.
Good dogs have an average retweet of 3,195 while only 917 retweets for bad dogs. The difference is very clear. Dog lovers seem to agree with the opinion of the dog rater again.

Rememeber, it's the dog poster that determines which dog is good and which is bad. He gives rating 10 and above for good dogs and rates dogs he considered bad 9 and below.

#### Insight 2:

It can also be deduced that dog lovers liked good dogs more than bad dogs.
Good dogs have an average likes of 10,547 while 2767 likes for bad dogs. The difference is very clear. Dog lovers seem to agree with the opinion of the dog rater.

Rememeber, it's the dog poster that determines which dog is good and which is bad. He gives rating 10 and above for good dogs and rates dogs he considered bad 9 and below.

## Research Question 7

### What breed is the dog with the highest number of likes?
- #### Are dogs of this breed generally good dogs?

In [None]:
# Record of dog breed with the highest number of likes.

dog[dog["total_likes"] == dog["total_likes"].max()]

#### Insight:
Dog lovers liked Lakeland Terrier the most.
Note that this record is only for dog lovers that hit the like button on Lakeland Terrier breed.
132,810 people liked a dog posted with the tweet_id of 822872901745569793.

### Is the breed generally a good breed?

In [None]:
# Records of dogs that belong to Lakeland Terrier breed.

lakeland_retriever = dog[dog["dog_breed"].str.contains("Lakeland Terrier")] 

In [None]:
# Average rating of dogs that are of Lakeland Terrier breed.
np.round(lakeland_retriever["rating(over_10)"].mean(),2)

#### Insight:
From the analysis, I can deduce that dogs that are of Lakeland Terrier breed are generally good dogs.
This is because their average rating is above 10.

Please, note that this result is based on average. There are also bad dogs among them. But there are more good dogs than bad dogs among the breed.

## Research Question 8

### Which dog breed did dog lovers  liked and retweeted the least?
-  #### Are dogs of the breed generally bad dogs.


In [None]:
# Record of dog with the lowest number of likes.
dog[dog["total_likes"] == dog["total_likes"].min()]

In [None]:
# Record of dog with the lowest number of retweets.
dog[dog["total_retweets"] == dog["total_retweets"].min()]

#### Insight:
From the above result, I found that the dog with the tweet id of 666102155909144576 was liked and retweeted the least by dog lovers. The dog belong to the English Setter breed.

#### Are dogs of the breed(English Setter) generally good or dogs.

In [None]:
# Record of dogs that belong to the English Setter breed.

english_setter = dog[dog["dog_breed"].str.contains("English Setter")]

In [None]:
#  The average rating of dogs that are of English Setter breed.

np.round(english_setter["rating(over_10)"].mean(), 2)

#### Insight:
From the result above, I can deduce that dogs that are of English Setter breed are generally bad dogs.
This is because their average rating is below 10.

Please, note that this result is based on average. There are also good dogs among them. But it appears there are more bad dogs than good dogs among the breed.

### Conclusion


- Dog with high ratings tend to be liked and retweeted more by dog lovers.
- Dogs with low ratings are retweeted less by dog lovers.
- Dog lovers sentiment seem to agree with the ratings of the dog poster.
- It appears that if the dog rating is very low or below 10, it’s more likely to be retweeted and liked less by dog lovers.

### Limitation:
In this project, I did not explore the date in which dogs ratings were tweeted. 
It’s important to know that the number of followers WeRateDogs have at the time of tweeting could impact the number of likes and retweets.
Dogs posted in 2016 and 2017 are more likely to have more likes than dogs posted in 2015. That’s because the number of followers has multiplied over the years.
Dogs posted in 2015 are more likely to receive low reactions because the account was created in the same year. Hence, I couldn’t draw conclusions on such dogs.


I must also mention that my analysis is limited and do not draw absolute conclusion on why dogs receives more reaction than other dogs. Further analysis needs to be carried out through advanced processes such as machine learning in order to draw conclusion on how dogs  with low reactions differ from dogs with high reactions from dog lovers.
