# Project: Wrangling and Analyze Data

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [1]:
import pandas as pd
import numpy as np

df_twitter = pd.read_csv('twitter-archive-enhanced.csv')
df_twitter.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [2]:
import requests
import os

#avoiding redownloading if the file exists already
if 'image-predictions.tsv' in os.listdir():
    pass
else:
    #using the requests library to download the tweet image prediction
    url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

    response = requests.get(url)
    
    with open('image-predictions.tsv', 'wb') as file:
        file.write(response.content)
        
#reading the data

df_breed = pd.read_csv('image-predictions.tsv', sep='\t')

df_breed.head()


Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

***Following block is Insturctor code provided. I could not get twitter (Now X) to work. When I run the following it does not return the correct JSON data. All records fail. I uploaded the JSON data provided with the Instructor code.***

In [None]:
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer

# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

# NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:
# df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to
# change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv
# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor
# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = df_twitter.tweet_id.values
len(tweet_ids)

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet__json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)


In [3]:
import json
#Avoid reparsing
if 'parsed_data.csv' in os.listdir():
    print('Data has been parsed')
    df_parse = pd.read_csv('parsed_data.csv')
else:
    with open('tweet-json.txt', 'r') as file:
        tweet_jsons = file.readlines()
    tweet_data = []
    
    for tweet in tweet_jsons:
        jl = json.loads(tweet)
        full_text = jl['full_text']
        
        if not full_text.startswith(('RT','@')):
            data = {'tweet_id' : jl['id_str'],
                    'full_text' : full_text,
                   'retweet_count' : jl['retweet_count'],
                   'favorite_count' : jl['favorite_count'],
                   'created' : pd.to_datetime(jl['created_at'])}
            tweet_data.append(data)
            
    df_parse = pd.DataFrame(tweet_data)
df_parse.to_csv('parsed_data.csv', index=False)
df_parse.head()

Data has been parsed


Unnamed: 0,created,favorite_count,full_text,retweet_count,tweet_id
0,2017-08-01 16:23:56,39467,This is Phineas. He's a mystical boy. Only eve...,8853,892420643555336193
1,2017-08-01 00:17:27,33819,This is Tilly. She's just checking pup on you....,6514,892177421306343426
2,2017-07-31 00:18:03,25461,This is Archie. He is a rare Norwegian Pouncin...,4328,891815181378084864
3,2017-07-30 15:58:51,42908,This is Darla. She commenced a snooze mid meal...,8964,891689557279858688
4,2017-07-29 16:00:24,41048,This is Franklin. He would like you to stop ca...,9774,891327558926688256


## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



In [4]:
#checking for missing data
df_twitter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [5]:
#checking for duplicated tweets
duplicate_twitter = df_twitter.duplicated(subset=['tweet_id']).sum()
print(duplicate_twitter)

0


In [6]:
#checking for missing data
df_breed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [7]:
duplicate_breed = df_breed.duplicated(subset=['tweet_id']).sum()
print(duplicate_breed)

0


In [8]:
#Checking how many P1_conf is below 51% and if p1_dog is false


low_confidence = df_breed[df_breed['p1_conf'] < 0.51]

no_name = df_twitter[df_twitter['name'].isnull() | (df_twitter['name'] == '') |
                    (df_twitter['name'] == 'None')]

num_low = low_confidence.shape[0]
num_no_name = no_name.shape[0]

print('Confidence level below 51%', num_low)
print('Posts with no dogs name', num_no_name)


Confidence level below 51% 869
Posts with no dogs name 745


In [9]:
df_parse.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2150 entries, 0 to 2149
Data columns (total 5 columns):
created           2150 non-null object
favorite_count    2150 non-null int64
full_text         2150 non-null object
retweet_count     2150 non-null int64
tweet_id          2150 non-null int64
dtypes: int64(3), object(2)
memory usage: 84.1+ KB


In [10]:
duplicate_parse = df_parse.duplicated(subset=['tweet_id']).sum()
print(duplicate_parse)

0


In [11]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
plt.scatter(df_twitter['rating_numerator'],
           df_twitter['rating_denominator'])
plt.xlabel('Rating Numerator')
plt.ylabel('Rating Denominator')
#highlighting a denominator of 10
plt.axhline(y=10, color='red', linewidth=2)
#highlighting a numerator of 25
plt.axvline(x=25, color='blue', linewidth=2)
plt.show()

<matplotlib.figure.Figure at 0x7f37e048ba58>

### Quality issues
1. Irrelevent columns exist in both df_twitter and df_breed. 

2. Confidence rating below 51% are too low 

3. Incorrect data types, tweet_id needs to be an object and timestamp needs to be a datetime object.

4. Ratings are too high or denominator does not equal 10

5. Retweets and @'s need to be removed so that only original posts are included.

6. Missing Images, any posts with no image need to be removed.

7. Including only data before August 1 2017

8. Posts that do not include the dogs name should be removed

### Tidiness issues
1. Multiple columns for dog stages, these should be combined into one column.

2. Data is in many different data frames, they should be mereged to provide for better ease of use.

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [12]:
# Make copies of original pieces of data
df_twitter_copy = df_twitter.copy()
df_breed_copy = df_breed.copy()
df_parse_copy = df_parse.copy()

df_twitter_copy.head()


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [13]:
df_breed_copy.head()


Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [14]:

df_parse_copy.head()

Unnamed: 0,created,favorite_count,full_text,retweet_count,tweet_id
0,2017-08-01 16:23:56,39467,This is Phineas. He's a mystical boy. Only eve...,8853,892420643555336193
1,2017-08-01 00:17:27,33819,This is Tilly. She's just checking pup on you....,6514,892177421306343426
2,2017-07-31 00:18:03,25461,This is Archie. He is a rare Norwegian Pouncin...,4328,891815181378084864
3,2017-07-30 15:58:51,42908,This is Darla. She commenced a snooze mid meal...,8964,891689557279858688
4,2017-07-29 16:00:24,41048,This is Franklin. He would like you to stop ca...,9774,891327558926688256


### Issue #1: Multiple columns for dog stage.

#### Define: Tidiness issue, Multiple Columns for dog stage need to be combined.

#### Code

In [15]:
#cleaning the columns for each dog stage before combining
df_twitter_copy.doggo.replace('None','',inplace=True)
df_twitter_copy.floofer.replace('None','',inplace=True)
df_twitter_copy.pupper.replace('None','',inplace=True)
df_twitter_copy.puppo.replace('None','',inplace=True)

#combining the dog stage columns into a single column "dog_stage"
df_twitter_copy['dog_stage'] = df_twitter_copy['doggo'] + df_twitter_copy['floofer'] + df_twitter_copy['pupper'] + df_twitter_copy['puppo']

#formating entries with multiple stages
df_twitter_copy.loc[df_twitter_copy.dog_stage == 'doggopupper', 'dog_stage'] = 'doggo,pupper'
df_twitter_copy.loc[df_twitter_copy.dog_stage == 'doggofloofer', 'dog_stage'] = 'doggo,floofer'
df_twitter_copy.loc[df_twitter_copy.dog_stage == 'doggopuppo', 'dog_stage'] = 'doggo,puppo'
df_twitter_copy.loc[df_twitter_copy.dog_stage == 'flooferpupper', 'dog_stage'] = 'floofer,pupper'
df_twitter_copy.loc[df_twitter_copy.dog_stage == 'flooferpuppo', 'dog_stage'] = 'floofer,puppo'
df_twitter_copy.loc[df_twitter_copy.dog_stage == 'pupperpuppo', 'dog_stage'] = 'pupper,puppo'
#removing individual dog stage columns
df_twitter_copy.drop(['doggo', 'floofer', 'pupper', 'puppo'], axis=1, inplace = True)

#displaying rows to verify
df_twitter_copy.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,dog_stage
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,


#### Test

In [16]:
assert 'dog_stage' in df_twitter_copy.columns, 'dog_stage column not found'
assert 'doggo' not in df_twitter_copy.columns, 'doggo column still present'
assert 'floofer' not in df_twitter_copy.columns, 'floofer column still present'
assert 'pupper' not in df_twitter_copy.columns, 'pupper column still present'
assert 'puppo' not in df_twitter_copy.columns, 'puppo column still present'

### Issue #2:  Irrelevant Columns 

#### Define Irrelevent columns exist in the data frames

#### Code

In [None]:
df_twitter_copy.head()

In [None]:
df_twitter_copy = df_twitter_copy[~df_twitter_copy['text'].str.contains('RT|@', case=False)]
df_twitter_copy.drop(['in_reply_to_status_id', 'in_reply_to_user_id', 'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id'
                     , 'retweeted_status_timestamp'], axis = 1, inplace=True)
df_twitter_copy.drop(df_twitter_copy[df_twitter_copy['expanded_urls'].isnull()].index, inplace=True)
df_twitter_copy.head()

In [None]:
df_twitter_copy.drop(df_twitter_copy[df_twitter_copy['name'].isnull() | (df_twitter_copy['name'] == '')
                                    | (df_twitter_copy['name'] == 'None')].index, inplace=True)
df_twitter_copy.head()

In [None]:
df_twitter_copy.drop(df_twitter_copy[df_twitter_copy['rating_numerator'] > 25].index, inplace=True)
df_twitter_copy.drop(df_twitter_copy[df_twitter_copy['rating_denominator'] != 10].index, inplace=True)
df_twitter_copy.head()

In [None]:
#converting data type
df_twitter_copy['timestamp'] = pd.to_datetime(df_twitter_copy['timestamp'])
df_twitter_copy.info()


In [None]:
#converting data type
df_parse_copy['created'] = pd.to_datetime(df_parse_copy['created'])
df_parse_copy.info()


In [None]:
df_breed_copy.head()

In [None]:
#Dropping irrelevent data
df_breed_copy.drop(['img_num', 'p2', 'p2_conf', 'p2_dog','p3','p3_conf','p3_dog'], axis=1, inplace=True)

#Improving readablity
df_breed_copy['breed'] = df_breed_copy['p1']
df_breed_copy['confidence_level'] = df_breed_copy['p1_conf']
#dropping old column names
df_breed_copy.drop(['p1', 'p1_conf'],axis=1,inplace=True)

#converting confidence level from a decimal to a percentage
df_breed_copy['confidence_level'] = (df_breed_copy['confidence_level'] *100).round(2)
#filtering out the data that is unreliable (below 50%)
df_breed_copy.drop(df_breed_copy[df_breed_copy['confidence_level'] < 50].index, inplace=True)

#dropping if not a dog
df_breed_copy.drop(df_breed_copy[df_breed_copy['p1_dog'] == False].index, inplace=True)

df_breed_copy.head()

In [None]:
#converting tweet_id into a string
df_parse_copy['tweet_id'] = df_parse_copy['tweet_id'].astype(str)
df_twitter_copy['tweet_id'] = df_twitter_copy['tweet_id'].astype(str)
df_breed_copy['tweet_id'] = df_breed_copy['tweet_id'].astype(str)

In [None]:
df_parse_copy.info()
df_twitter_copy.info()
df_breed_copy.info()

#### Test

In [None]:
#Unneeded columns
assert 'in_reply_to_status_id' not in df_twitter_copy.columns, 'in_reply_to_status_id still present'
assert 'in_reply_to_user_id' not in df_twitter_copy.columns, 'in_reply_to_user_id still present'
assert 'retweeted_status_id' not in df_twitter_copy.columns, 'retweeted_status_id still present'
assert 'retweeted_status_user_id' not in df_twitter_copy.columns, 'retweeted_status_user_id still present'
assert 'retweeted_status_timestamp' not in df_twitter_copy.columns, 'retweeted_status_timestamp still present'
#Confidence Rating Too Low
assert (df_breed_copy['confidence_level'] >= 50).all(), 'Confidence level below 50% found'
#All entries are dogs
assert (df_breed_copy['p1_dog'] ==True).all(), 'Entries are found that are not dogs.'
#verifying data types
assert df_parse_copy['created'].dtype == 'datetime64[ns]', 'Created column not converted to dattime'
assert df_twitter_copy['timestamp'].dtype == 'datetime64[ns]', 'Timestamp column not converted to dattime'
assert df_parse_copy['tweet_id'].dtype == 'object','tweet_id column not converted to string in parse data frame'
assert df_twitter_copy['tweet_id'].dtype == 'object','tweet_id column not converted to string in twitter data frame'
assert df_breed_copy['tweet_id'].dtype == 'object','tweet_id column not converted to string in breed data frame'
#verifying that all records have the dogs names
assert not df_twitter_copy['name'].isin(['', 'None']).any(), 'Rows with empty or "None" names still present'


## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

In [None]:
#merging the data frames and saving the master_df into 'twitter_archive_master.csv'
master_df = pd.merge(df_parse_copy, df_twitter_copy, on='tweet_id', how='inner')
master_df = pd.merge(master_df, df_breed_copy, on= 'tweet_id', how='inner')
master_df.head()

In [None]:
#saving master_df to csv
master_df.to_csv('twitter_archive_master.csv', index=False)

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

In [None]:
#Insight 1, the distriubution of retweet count and dog breed. 
#I want to try to determine if there is any correlation.
filtered_df = master_df.dropna(subset=['breed', 'retweet_count'])

breed_summary = filtered_df.groupby('breed')['retweet_count'].describe()

#sorted
sorted_summary = breed_summary.sort_values(by='mean',ascending=False)

sorted_summary.head(3)


In [None]:
sorted_summary.tail(3)

In [None]:
filtered_df2 = master_df[['breed', 'favorite_count']].copy()
mean=filtered_df2.groupby('breed')['favorite_count'].mean()
sorted_mean= mean.sort_values(ascending=False)
sorted_mean.head(3)

In [None]:
sorted_mean.tail(3)

In [None]:
most_popular_names = master_df['name'].value_counts().head(10)
most_popular_names

In [None]:
names_with_a = master_df[master_df['name'] == 'a']
names_with_a

In [None]:
master_df = master_df[master_df['name'] != 'a']
master_df = master_df[master_df['name'] != 'the']

In [None]:
most_popular_names = master_df['name'].value_counts().head(10)
most_popular_names

### Insights:
1. I wanted to see the distribution of retweet count by dog breed and see the top 3 and bottom 3 breeds. The insight that I observed was the most popular 3 breeds were the Briard, Irish Setter, and the Irish Water Spaniel while the bottom 3 retweets were the Tibetan Terrier, the Redbone, and the Black and Tan Coonhound.

2. The second insight that I wanted to observe is if the previous insight would match up with the favorites. Upon closer inspection the top 3 favorited breeds cahnged up some but the bottom 3 breeds was consistent, The Tibetan Terrier, The Redbone, and The Black and Tan Coonhound. 

3. The last insight that I wanted to look at was the most popular dog names. While looking at this data I noticed that there were multiple entries where the dog's name was "a" or "the". This is most likely a typo or a placeholder for the dogs name. I displayed the records to double check and then I dropped them.

### Visualization

In [None]:
plt.figure(figsize=(10,6))
sorted_mean.head(10).plot(kind='bar',
                         color='skyblue')
plt.title('Top 10 Breeds by Average Favorite Count')
plt.xlabel('Breed')
plt.ylabel('Average Favorite Count')
plt.xticks(rotation=45)
plt.show()

In [None]:
filtered_df2 = master_df[['breed', 'retweet_count']].copy()
mean=filtered_df2.groupby('breed')['retweet_count'].mean()
sorted_mean= mean.sort_values(ascending=False)



plt.figure(figsize=(10,6))
sorted_mean.head(10).plot(kind='bar',
                         color='red')
plt.title('Top 10 Breeds by Average Retweet Count')
plt.xlabel('Breed')
plt.ylabel('Average Retweet Count')
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(8,8))
most_popular_names.plot(kind='pie',
                       autopct='%1.1f%%',
                       startangle=140)
plt.title('Most Popular Dog Names')
plt.ylabel('')

total_percentage = sum(most_popular_names)
plt.text(x=0, y=0, s=f'Total Percentage: {total_percentage:.1f}%', ha='center',
        fontsize=12, color='white')
plt.show()