# WeRateDogs Data Wrangling Project
The dataset being used is a twitter archive of the Twitter user @dog_rates also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dogs. 

## Table of Content
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#gather">Gathering</a></li>
<li><a href="#asses">Assessing</a></li>
<li><a href="#quality">Quality</a></li>
<li><a href="#tidiness">Tidiness</a></li>
<li><a href="#clean">Cleaning</a></li>
<li><a href="#analyse">Analysing and Vizualization</a></li>
<li><a href="#insight">Insights</a></li>
<li><a href="#ref">Reference</a></li>
</ul>

<a id='intro'></a>
## Introduction

In [1]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import tweepy
import json
import time

ModuleNotFoundError: No module named 'tweepy'

<a id='gather'></a>
## Gathering

In [None]:
# load archive data
twitter_archive_df = pd.read_csv('twitter-archive-enhanced.csv')
twitter_archive_df.head()

In [None]:
# download the image prediction data
url = ' https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
r = requests.get(url)

#Check the kind of data being fetched
print(r.headers.get('content-type'))

#Save the file
open('image_predictions.tsv', 'wb').write(r.content)

# read the .tsv in a datframe
image_predictions_df = pd.read_csv('image_predictions.tsv' ,sep='\t')
image_predictions_df.head()

In [None]:
#test tweepy

#auth = tweepy.OAuthHandler("CONSUMER_KEY", "CONSUMER_SECRET")
#auth.set_access_token("ACCESS_TOKEN", "ACCESS_TOKEN_SECRET")

# Authenticate to Twitter
auth = tweepy.OAuthHandler("JZm9suzrkWGH8q9LDXdRvQTRf","KUEYw3ezHz1ZiWv1b8TsLKg6moGW1YagWR5An08V6a5gkD0TLx")
auth.set_access_token("1676399646-yO1VMF02zCZbsORZCHRzbPqrFfdFcUHTIkY7nzY", "3jwmwYvLWO8FSBkJhmXuDNn3ICyrMOHvCZjn5uDUkThWH")

api = tweepy.API(auth)

try:
    api.verify_credentials()
    print("Authentication OK")
except:
    print("Error during authentication")

In [None]:
# Use tweepy to extract retweets and likes
#**SIDE NOTE: DEL API KEYS B4 SUBMISSION**
    
# Authenticate to Twitter
CONSUMER_KEY = "JZm9suzrkWGH8q9LDXdRvQTRf"
CONSUMER_SECRET = "KUEYw3ezHz1ZiWv1b8TsLKg6moGW1YagWR5An08V6a5gkD0TLx"
ACCESS_TOKEN = "1676399646-yO1VMF02zCZbsORZCHRzbPqrFfdFcUHTIkY7nzY"
ACCESS_TOKEN_SECRET = "3jwmwYvLWO8FSBkJhmXuDNn3ICyrMOHvCZjn5uDUkThWH"

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

# Create API object
api = tweepy.API(auth, wait_on_rate_limit=True)

In [None]:
tweet_ids = twitter_archive_df.tweet_id.values
tweet_ids = np.array(tweet_ids)
len(tweet_ids)

In [None]:
count = 0
fails_dict = {}
start = time.time()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w', encoding='utf8') as file:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode= 'extended')._json
            print("Success")
            json.dump(tweet, file)
            file.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = time.time()
print(fails_dict)
print(end - start)

In [None]:
tweet_json_list = []

start = time.time()

try:
    with open('tweet_json.txt', 'r') as file:
        for line in file:
            tweet = json.loads(line)
            
            #tweet id
            tweet_id = tweet['id_str']
            
            # number of likes in the tweet
            favorites = tweet['favorite_count']

            # number of retweets for the tweet
            retweets = tweet['retweet_count'] 

            #tweet's timestamp
            date_time = tweet['created_at']
            
            # append the fields
            tweet_json_list. append({'tweet_id': tweet_id,
                                    'favorites': favorites,
                                    'retweets': retweets,
                                    'date_time': date_time})
except FileNotFoundError:
    print("Oops! No such file")


end = time.time()
duration = end - start
print(duration)

In [None]:
tweet_json_df = pd.DataFrame(tweet_json_list, columns=['tweet_id', 'favorites', 'retweets','date_time'])
tweet_json_df.head()

<a id='asses'></a>
## Assessing

### Assess `twitter_archive_df`

In [None]:
twitter_archive_df

In [None]:
image_predictions_df

In [None]:
tweet_json_df

In [None]:
twitter_archive_df.info()

In [None]:
image_predictions_df.info()

In [None]:
tweet_json_df.info()

In [None]:
twitter_archive_df['rating_numerator'].value_counts()

In [None]:
twitter_archive_df['rating_denominator'].value_counts()

In [None]:
#check for duplicates
twitter_archive_df.duplicated().sum()

In [None]:
image_predictions_df.duplicated().sum()

In [None]:
tweet_json_df.duplicated().sum()

In [None]:
# view basic statistical details
twitter_archive_df.describe()

In [None]:
image_predictions_df.describe()

In [None]:
tweet_json_df.describe()

<a id='quality'></a>
#### Quality
**`twitter_archive_df` table**
- dataframe contains retweets.
- missing data in columns `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp` and `expanded_urls`.
- Incorrect datatype(`timestamp`,`retweeted_status_timestamp`,`in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp`, `rating_numerator` columns).
- Some entries have `rating_denominator` column has denominator more than 10.
- Some enties in both `rating_numerators` and `rating_denominator` are very large.
- Some entries in the `rating_numerator` and `rating_denominator` column are different from the extracted text
- Some rows are missing ratings.


**`image_prediction_df` table**
- missing entries - Has less enties than `twitter_archive_df` table.
- incorrect datatypes(p1,p2 and p3 columns).
- some entries do not have a dog breed.

**`tweet_json_df` table**
- missing entries - Has less enties than `twitter_archive_df` table.
- Incorrect datatype (`date_time` and `tweet_id` columns).

<a id='tidiness'></a>
#### Tidiness
- *date_time* column from `tweet_json_df` and *timestamp* column from `twitter_archive_df` have same content but diff column names.
- The dog stages, *doggo*, *floofer*, *pupper*, *puppo* are in four different columns in `twitter_archive_df` table.
- Some columns are not needed for analysis in all the 3 datasets.
- All the three tables belong to one table.
- Dataframe `image_prediction_df` has 3 different columns with possible dog breeds.

<a id='clean'></a>
## Clean

In [None]:
twitter_archive_clean = twitter_archive_df.copy()
image_predictions_clean = image_predictions_df.copy()
tweet_json_clean = tweet_json_df.copy()

### `twitter_archive_clean` Table

#### Define

- Use `retweeted_status_id` to delete retweets. Any row with content in this column or `retweeted_status_user_id` is a retweet

#### Code

In [None]:
# delete retweets
twitter_archive_clean = twitter_archive_clean[pd.isnull(twitter_archive_clean['retweeted_status_user_id'])]

#### Test

In [None]:
twitter_archive_clean.info()

#### Define
- Drop all the unnccessary columns from the table ('in_reply_to_status_id', 'in_reply_to_user_id', 'source','retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp', 'expanded_urls). This will also take care of the missing data in those columns.

#### Code

In [None]:
twitter_archive_clean.drop(['in_reply_to_status_id', 'in_reply_to_user_id', 'source','retweeted_status_id',
                           'retweeted_status_user_id','retweeted_status_timestamp', 'expanded_urls','name'], axis = 1, inplace = True)

#### Test

In [None]:
twitter_archive_clean.info()

#### Define
- Change `timestamp` to datetime and `rating_numerator` to float data types. 

#### Code

In [None]:
# change datatype to datetime
twitter_archive_clean['timestamp'] = pd.to_datetime(twitter_archive_clean['timestamp'])
twitter_archive_clean['rating_numerator'] = twitter_archive_clean['rating_numerator'].astype('float')

#### Test

In [None]:
twitter_archive_clean.info()

#### Define
- Correct all the incorrect numerators to match the value in the text.

#### Code

In [None]:
# change the incorrect whole
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 666287406224695296) & (twitter_archive_clean['rating_numerator'] == 1), ['rating_numerator']] = 9
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 881633300179243008) & (twitter_archive_clean['rating_numerator'] == 17), ['rating_numerator']] = 13
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 835246439529840640) & (twitter_archive_clean['rating_numerator'] == 960), ['rating_numerator']] = 13
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 716439118184652801) & (twitter_archive_clean['rating_numerator'] == 50), ['rating_numerator']] = 11
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 722974582966214656) & (twitter_archive_clean['rating_numerator'] == 4), ['rating_numerator']] = 13
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 682962037429899265) & (twitter_archive_clean['rating_numerator'] == 7), ['rating_numerator']] = 10

#change incorrect floats
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 786709082849828864) & (twitter_archive_clean['rating_numerator'] == 75), ['rating_numerator']] = 9.75
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 778027034220126208) & (twitter_archive_clean['rating_numerator'] == 27), ['rating_numerator']] = 11.27
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 680494726643068929) & (twitter_archive_clean['rating_numerator'] == 26), ['rating_numerator']] = 11.26
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 883482846933004288) & (twitter_archive_clean['rating_numerator'] == 5), ['rating_numerator']] = 13.5
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 681340665377193984) & (twitter_archive_clean['rating_numerator'] == 5), ['rating_numerator']] = 9.5


# one pic many dogs
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 731156023742988288) & (twitter_archive_clean['rating_numerator'] == 204), ['rating_numerator']] = 12
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 758467244762497024) & (twitter_archive_clean['rating_numerator'] == 165), ['rating_numerator']] = 11
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 677716515794329600) & (twitter_archive_clean['rating_numerator'] == 144), ['rating_numerator']] = 12
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 684225744407494656) & (twitter_archive_clean['rating_numerator'] == 143), ['rating_numerator']] = 11
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 684222868335505415) & (twitter_archive_clean['rating_numerator'] == 121), ['rating_numerator']] = 11
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 713900603437621249) & (twitter_archive_clean['rating_numerator'] == 99), ['rating_numerator']] = 11
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 675853064436391936) & (twitter_archive_clean['rating_numerator'] == 88), ['rating_numerator']] = 11
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 820690176645140481) & (twitter_archive_clean['rating_numerator'] == 84), ['rating_numerator']] = 12
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 710658690886586372) & (twitter_archive_clean['rating_numerator'] == 80), ['rating_numerator']] = 10
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 704054845121142784) & (twitter_archive_clean['rating_numerator'] == 60), ['rating_numerator']] = 12
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 709198395643068416) & (twitter_archive_clean['rating_numerator'] == 45), ['rating_numerator']] = 9
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 697463031882764288) & (twitter_archive_clean['rating_numerator'] == 44), ['rating_numerator']] = 11
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 686035780142297088) & (twitter_archive_clean['rating_numerator'] == 10), ['rating_numerator']] = 2


#### Test

In [None]:
twitter_archive_clean['rating_numerator'].value_counts()

#### Define
- Change the `rating_denominator` of incorrect all entries to 10.

#### Code

In [None]:
# change incorrect denoms
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 666287406224695296) & (twitter_archive_clean['rating_denominator'] == 2), ['rating_denominator']] = 10
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 681340665377193984) & (twitter_archive_clean['rating_denominator'] == 5), ['rating_denominator']] = 10
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 740373189193256964) & (twitter_archive_clean['rating_denominator'] == 11), ['rating_denominator']] = 10
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 716439118184652801) & (twitter_archive_clean['rating_denominator'] == 50), ['rating_denominator']] = 10
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 722974582966214656) & (twitter_archive_clean['rating_denominator'] == 20), ['rating_denominator']] = 10
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 682962037429899265) & (twitter_archive_clean['rating_denominator'] == 11), ['rating_denominator']] = 10


#the one pic many dog denom

twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 731156023742988288) & (twitter_archive_clean['rating_denominator'] == 170), ['rating_denominator']] = 10
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 758467244762497024) & (twitter_archive_clean['rating_denominator'] == 150), ['rating_denominator']] = 10
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 677716515794329600) & (twitter_archive_clean['rating_denominator'] == 120), ['rating_denominator']] = 10
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 684225744407494656) & (twitter_archive_clean['rating_denominator'] == 130), ['rating_denominator']] = 10
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 684222868335505415) & (twitter_archive_clean['rating_denominator'] == 110), ['rating_denominator']] = 10
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 713900603437621249) & (twitter_archive_clean['rating_denominator'] == 90), ['rating_denominator']] = 10
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 675853064436391936) & (twitter_archive_clean['rating_denominator'] == 80), ['rating_denominator']] = 10
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 820690176645140481) & (twitter_archive_clean['rating_denominator'] == 70), ['rating_denominator']] = 10
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 710658690886586372) & (twitter_archive_clean['rating_denominator'] == 80), ['rating_denominator']] = 10
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 704054845121142784) & (twitter_archive_clean['rating_denominator'] == 50), ['rating_denominator']] = 10
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 709198395643068416) & (twitter_archive_clean['rating_denominator'] == 50), ['rating_denominator']] = 10
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 697463031882764288) & (twitter_archive_clean['rating_denominator'] == 40), ['rating_denominator']] = 10
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 835246439529840640) & (twitter_archive_clean['rating_denominator'] == 0), ['rating_denominator']] = 10
twitter_archive_clean.loc[(twitter_archive_clean['tweet_id'] == 686035780142297088) & (twitter_archive_clean['rating_denominator'] == 20), ['rating_denominator']] = 10


#### Test

In [None]:
twitter_archive_clean['rating_denominator'].value_counts()

#### Define

- Drop the rows with ratings; 20/16, 24/7, 11/15 as they have no proper ratings.

#### Code

In [None]:
#drop the rows with no clear ratings
twitter_archive_clean.drop([342,516,1663], inplace=True)

#### Test

In [None]:
twitter_archive_clean['rating_denominator'].value_counts()

#### Define TO BE LOOKED AT
- Melt the *'doggo'*, *'floofer'*, *'pupper'* and *'puppo'* columns into one column called *dog_stage*.

#### Code

In [None]:
#melt the columns and creat one column
twitter_archive_clean = pd.melt(twitter_archive_clean, id_vars=['tweet_id', 'timestamp', 'text', 'rating_numerator',
                                                               'rating_denominator'],
                               value_vars = ['doggo', 'floofer', 'pupper', 'puppo'], var_name = 'dog_stage')

#drop the value column
twitter_archive_clean = twitter_archive_clean.drop('value', axis=1)

In [None]:
twitter_archive_clean['dog_stage'].value_counts()

In [None]:
dog_stage_classes = ['doggo', 'floofer', 'pupper', 'puppo']
pd_ver = pd.__version__.split(".")
if (int(pd_ver[0]) > 0) or (int(pd_ver[1]) >= 21): # v0.21 or later
    dclasses = pd.api.types.CategoricalDtype(ordered = False, categories = dog_stage_classes)
    twitter_archive_clean['dog_stage'] = twitter_archive_clean['dog_stage'].astype(dclasses)
else: # pre-v0.21
    twitter_archive_clean['dog_stage'] = twitter_archive_clean['dog_stage'].astype('category', ordered = False,
                                                         categories = dog_stage_classes)

#### Test

In [None]:
twitter_archive_clean.info()

#### Define
- Drop the dog_stage column as the data is in correct.

#### Code

In [None]:
twitter_archive_clean.drop(['dog_stage'], axis=1)

#### Test

In [None]:
twitter_archive_clean.info()

### `image_predictions_df` Table

#### Define
- Delete the entries where there is no dog breed prediction.

#### Code

In [None]:
# dataframe with no dog breed prediction
wrong_breeds = image_predictions_clean[~image_predictions_clean.p1_dog & \
               ~image_predictions_clean.p2_dog & \
               ~image_predictions_clean.p3_dog][['tweet_id', 'p1', 'p1_dog',
                                        'p2', 'p2_dog', 'p3', 'p3_dog']]
# print(wrong_breeds.info())
idx = list(wrong_breeds.index.values)

# drop the missing dogs dataframe
image_predictions_clean.drop(idx , inplace = True)

#### Test

In [None]:
image_predictions_clean.info()

In [None]:
image_predictions_clean

#### Define
- Choose one dog breed out of the three using the highest confidence and save it in a different column.

#### Code

In [None]:
confidence = []  
dog_breed = []
real_breed = []

#Find the confidence and dog breeds   
for idx in image_predictions_clean.index:
    if ((image_predictions_clean['p1_conf'][idx] > image_predictions_clean['p2_conf'][idx]) & (image_predictions_clean['p1_conf'][idx] > image_predictions_clean['p3_conf'][idx])) & (image_predictions_clean['p1_dog'][idx] == True):
            confidence.append(image_predictions_clean['p1_conf'][idx])
            dog_breed.append(image_predictions_clean['p1'][idx])
    elif ((image_predictions_clean['p2_conf'][idx] > image_predictions_clean['p3_conf'][idx]) & (image_predictions_clean['p2_dog'][idx] == True)):
            confidence.append(image_predictions_clean['p2_conf'][idx])
            dog_breed.append(image_predictions_clean['p2'][idx])
    elif (image_predictions_clean['p3_dog'][idx] == True ):
            confidence.append(image_predictions_clean['p3_conf'][idx])
            dog_breed.append(image_predictions_clean['p3'][idx])
    else:
        print('dont know the error')
        
#convert the lists to numpy arrays
confidence = np.array(confidence)
dog_breed = np.array(dog_breed)

print(len(confidence))
print(len(dog_breed))
#print(len(errors))


print(len(image_predictions_clean))

#add these arrays to the dataframe as rows
image_predictions_clean['confidence'] = confidence
image_predictions_clean['dog_breed'] = dog_breed

#### Test

In [None]:
image_predictions_clean.info()

In [None]:
image_predictions_clean.head(30)

#### Define
We chose one column out of the three and saved the dog_breed column
- Correct the *dog_breed* column to be categorical.

#### Code

In [None]:
#get the dog_breeds
dog_breed_classes = image_predictions_clean.loc[:, "dog_breed"].to_list()

#remove any duplicates
dog_breed_classes = list(dict.fromkeys(dog_breed_classes))
dog_breed_classes = np.array(dog_breed_classes)
type(dog_breed_classes)

In [None]:
len(dog_breed_classes)

In [None]:
pd_ver = pd.__version__.split(".")
if (int(pd_ver[0]) > 0) or (int(pd_ver[1]) >= 21): # v0.21 or later
    dclasses = pd.api.types.CategoricalDtype(ordered = False, categories = dog_breed_classes)
    image_predictions_clean['dog_breed'] = image_predictions_clean['dog_breed'].astype(dclasses)
else: # pre-v0.21
    image_predictions_clean['dog_breed'] = image_predictions_clean['dog_breed'].astype('category', 
                                                                ordered = False, categories = dog_breed_classes)

#### Test

In [None]:
image_predictions_clean.info()

#### Define
- Drop *p1*, *p2*, *p3*, *p1_conf*, *p2_conf*, *p3_conf*, *p1_dog*, *p2_dog*, *p3_dog*.

#### Code

In [None]:
image_predictions_clean.drop(['p1', 'p2', 'p3', 'p1_conf', 'p2_conf', 'p3_conf', 'p1_dog', 'p2_dog', 'p3_dog'], axis=1, inplace=True)

#### Test

In [None]:
image_predictions_clean.info()

#### Define
- Drop *jpg_url* from the dataset.

#### Code

In [None]:
# drop unncessary columns
image_predictions_clean.drop(['jpg_url'], axis=1, inplace=True)

#### Test

In [None]:
image_predictions_clean

### `tweet_json_df` Table

#### Define
- Correct *tweet_id* datatype to int

#### Code

In [None]:
#change datatypes
tweet_json_clean['tweet_id'] = tweet_json_clean['tweet_id'].astype(int)
tweet_json_clean['date_time'] = pd.to_datetime(tweet_json_clean['date_time'])

#### Test

In [None]:
tweet_json_clean.info()

#### Define 
- Change `tweet_json_clean`'s *date_time* column name to *datetime*.

#### Code

In [None]:
# rename the dataframe column
tweet_json_clean = tweet_json_clean.rename(columns={"date_time":"timestamp"})

#### Test

In [None]:
tweet_json_clean.info()

> After cleaning all the dataset individually,, we then have to combine all the datasets to one.

#### Define
- Join all the three tables to one. First combine `twitter_archive_clean` and `tweet_json_clean`. Then combine the results with `image_recognition_clean`

#### Code

In [None]:
# merge twitter_archive_clean and tweet_json_clean
df1 = pd.merge(twitter_archive_clean, tweet_json_clean, on = ['tweet_id', 'timestamp'], how = 'left')

In [None]:
# merge the new df with *image_recogition_clean*
rate_dogs_df = pd.merge(df1, image_predictions_clean, on=['tweet_id'], how = 'left')

#### Test

In [None]:
rate_dogs_df.head()

#### Define
- Drop any null rows in the dataframe.

#### Code

In [None]:
# drop rows with null values
rate_dogs_df.dropna(axis=0, how = 'any', inplace = True )

#### Test

In [None]:
rate_dogs_df

In [None]:
# Save the cleaned dataset as a csv file
rate_dogs_df.to_csv('twitter_archive_master.csv')

<a id='analyse'></a>
## Analysis and Vizualization

After assessing and cleaning our dataset, we can therefore go ahead and analyse it. Some of the question we want use to asses the dataset are:

- What are most common breed rated by WeRateDogs? Top 10.
- Is there a pair-wise relationship between retweets, favorites and the ratings given by the account?
- What is the relationship between retweets and fevorites?
- What is the relationship betweet ratings and retweets?
- What is the relationship betweet ratings and favorites?
- Is there a relationship between number of retweets, favorites and the ratings given by the account?
- What is the most common ratings given by the twitter account?

We are therefore going to do some exploratory visual analysis to try and answer these questions.

### Q1:  What are most common dog breeds rated by WeRateDogs? Top 10.

In [None]:
# Create a count function
def counter(column):
    
    #join all the entries in the column to a single string
    concat_str = rate_dogs_df[column].str.cat(sep='|')
    
    # Change the string to series
    sep_str = pd.Series(concat_str.split(sep='|'))
    
    return sep_str

In [None]:
# Store the series in a dataframe 
dog_breed_df = counter('dog_breed').to_frame('dog_breed')

#Get the unique values
dog_breed_df = dog_breed_df['dog_breed'].value_counts().to_frame('dog_breed').reset_index()

#rename columns
dog_breed_df = dog_breed_df.rename(columns={'index':'dog_breed','dog_breed':'dog_num'})

In [None]:
#Cut the first 10 dog breeds
new_dog_breed = dog_breed_df.iloc[:10]

#Set color for the bar graph
base_color = sns.color_palette()[0]
fig_dims = (20, 15)
fig, ax = plt.subplots(figsize = fig_dims)

#Plot the bargraph
sns.barplot(data = new_dog_breed, x = 'dog_num', y = 'dog_breed', color = base_color, ax = ax);

#Set the title and axes
plt.title("Top 10 Most Common Dog Breed", fontsize=16);
plt.xlabel("Total Number Of Dogs",fontsize=15);
plt.ylabel("Dog Breeds",fontsize= 15);
plt.savefig('dbreed.png');

### Q2: Is there a pair-wise relationship between retweets, likes, rating?

In [None]:
sns.pairplot(rate_dogs_df, vars = ['rating_numerator', 'retweets', 'favorites'],
            plot_kws = {'alpha' : 1/5},diag_kind = 'kde', height = 3);

### Q3: What is the relationship between retweets and favorites?

In [None]:
#set the scales
fig_dims = (20, 15)
fig, ax = plt.subplots(figsize = fig_dims)
plt.xscale('log');
plt.yscale('log');
#plot 
plt.scatter(data = rate_dogs_df, x = 'favorites', y = 'retweets' );
plt.title('Retweets vs Favorites ');
plt.xlabel('Favorites (log)');
plt.ylabel('Retweets (log)');
plt.savefig('retwfav.png');

### Q4: What is the relationship betweet ratings and retweets?

In [None]:
#set the scales
fig_dims = (20, 15)
fig, ax = plt.subplots(figsize = fig_dims)
plt.xscale('log');

#plot 
sns.regplot(data = rate_dogs_df, x = 'retweets', y = 'rating_numerator', fit_reg = False, 
          x_jitter=0.2, y_jitter=0.1, scatter_kws = {'alpha': 1/3});

plt.axhline(y = 9, color = 'r');
plt.title('Retweets vs Ratings ');
plt.xlabel('Retweets (log)');
plt.ylabel('Ratings');
plt.savefig('rr.png')

### Q5: What is the relationship betweet ratings and favorites?

In [None]:
#set the scales
fig_dims = (20, 15)
fig, ax = plt.subplots(figsize = fig_dims)
plt.xscale('log');

#plot 
sns.regplot(data = rate_dogs_df, x = 'favorites', y = 'rating_numerator', fit_reg = False, 
          x_jitter=0.2, y_jitter=0.1, scatter_kws = {'alpha': 1/3});
plt.axhline(y = 9, color = 'r');
plt.title('Favorites vs Ratings ');
plt.xlabel('Favorites (log)');
plt.ylabel('Ratings');
plt.savefig('rf.png');

### Q4:Is there a relationship between number of retweets, likes and the ratings given by the account?

In [None]:
#Set dimension size
fig_dims = (20, 15)
fig, ax = plt.subplots(figsize = fig_dims)

plt.xscale('log')
plt.yscale('log')
plt.xlim(10 , 1000000)
plt.ylim(10, 1000000)

#plot 
plt.scatter(data = rate_dogs_df, x = 'retweets', y = 'favorites', c = 'rating_numerator', cmap = 'viridis_r');
plt.colorbar();
plt.savefig('rrf.png')

### Q5: What is the most common ratings given by the twitter account?

In [None]:
fig_dims = (20, 15)
fig, ax = plt.subplots(figsize = fig_dims)

sns.set_style("whitegrid")
ax = sns.boxplot(y =rate_dogs_df['rating_numerator'])
plt.savefig('b_rating')

In [None]:
#use histogram
#set size
fig_dims = (20, 15)
fig, ax = plt.subplots(figsize = fig_dims)

bins = np.arange(0, rate_dogs_df['rating_numerator'].max()+1, 1);
plt.hist(data = rate_dogs_df, x = 'rating_numerator', bins = bins);
plt.savefig('h_rating.png')

<a id='insight'></a>
## Conclusions

After analysis of our cleaned dataset, the following conclutions were drawn:
- Retweets and favorites have a strong positive correlation. This means that the as the number of retweets increase so does the number of favorites.
- Dogs that have a high ratings tends to have more retwwets and favorites.
- Highly rated dogs have a high number retweets.
- Highly rated dogs have a high number favorites.
- The rating with the highest number of dogs is 13.

<a id='ref'></a>
## Reference

[Pandas Documentation]("https://pandas.pydata.org/docs/user_guide/")

[Numpy Documentation]("https://numpy.org/doc/stable/user/")

[Seaborn Documentation]("https://seaborn.pydata.org/tutorial.html")

[Matplotlib Documentation]("https://matplotlib.org/3.2.1/tutorials/index.html")

[Github]("https://github.com/stephanderton/We-Rate-Dogs-Data-Wrangling-Project")
