# We Rate Dogs Data Analysis 

We are going to analyze data coming from the WeRateDogs twitter channel. This project aims to practice thorough data wrangling techniques. Additionally, the goal is to find out interesting facts and write a report.

Your goal: wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. 

## Table of Contents

1. [Introduction](#introduction)
2. [Data Wrangling](#data-wrangling)
    1. [Gathering data](#gathering-data)
    2. [Assessing data](#assessing-data)
    3. [Cleaning data](#cleaning-data) 
3. [Analysis and Visualization](#analysis-and-visualization)
4. [Reporting](#reporting)

## Introduction <a name="introduction"></a>
Some introduction text, formatted in heading 2 style

To get started, let's import our libraries.

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
import time
import json
import requests
import os.path
import re
%matplotlib inline

pd.set_option('display.max_rows', 2500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 5000) 

## Data Wrangling <a name="data-wrangling"></a>
The first paragraph text

### Gathering Data <a name="gathering-data"></a>
Read in the first data set: WeRateDogs Twitter archive provided by Udacity.

In [2]:
# read in twitter archive 
twitter_dogs_archive = pd.read_csv('twitter-archive-enhanced-2.csv')
twitter_dogs_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


Download and read in image predictions file provided by Udacity.

In [None]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

# download file programmatically
response = requests.get(url)
    
# create new file if not existent
if not os.path.exists('image-predictions.tsv'):
    file = open('image-predictions.tsv', 'w')
    file.close()

# open file and write file content
with open('image-predictions.tsv', 'wb') as file_image_predictions:
        file_image_predictions.write(response.content)
        

In [3]:
# load image predictions into data frame
image_predictions = pd.read_csv('image-predictions.tsv', sep='\t')
image_predictions.head(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
5,666050758794694657,https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg,1,Bernese_mountain_dog,0.651137,True,English_springer,0.263788,True,Greater_Swiss_Mountain_dog,0.016199,True
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,0.045885,False,terrapin,0.017885,False
7,666055525042405380,https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg,1,chow,0.692517,True,Tibetan_mastiff,0.058279,True,fur_coat,0.054449,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping_cart,0.962465,False,shopping_basket,0.014594,False,golden_retriever,0.007959,True
9,666058600524156928,https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg,1,miniature_poodle,0.201493,True,komondor,0.192305,True,soft-coated_wheaten_terrier,0.082086,True


### Twitter API request using tweepy 

In [None]:
# hide login details
with open('logins.json') as login_file:
    logins = json.load(login_file)

def get_secret(setting, logins=logins):
    """Get login setting or fail with ImproperlyConfigured"""
    try:
        return logins[setting]
    except KeyError:
        raise ImproperlyConfigured("Set the {} setting.".format(setting))

In [None]:
# retrieve Twitter login details 
consumer_key = get_secret('consumer_key')
consumer_secret = get_secret('consumer_secret')
access_token = get_secret('access_token')
access_secret = get_secret('access_secret')

In [None]:
# access Twitter API
import tweepy

# Redirect to Twitter to authorize
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

# Get access token
auth.set_access_token(access_token, access_secret)

# api instance
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)


In [None]:
# get tweets from WeRateDogs Twitter timeline 
start = time.time()
error_ids = []
print("Start requesting WeRateDogs tweets.")
with open('tweet_json.txt', 'w', encoding='utf-8') as file:
    file.write("[\n")
    for index, tweet_id in enumerate(twitter_dogs_archive.tweet_id.values):
        ranking = index + 1 
        try:
            # Twitter API request using specific tweet_id
            status = api.get_status(tweet_id, tweet_mode='extended')
            status_json = json.dumps(status._json)

            # write json object
            file.write(status_json)
            if ranking < len(twitter_dogs_archive.tweet_id.values):
                file.write(",")
            file.write("\n")
            
            # This cell is slow so print ranking to gauge time remaining
            print(ranking, '-', tweet_id)

        except tweepy.TweepError as e:
            # catch erroneos
            error_ids.append(tweet_id)
            e = e.response.text
            print(e)
    file.write("]")
end = time.time()
print("Process finisheed. Time elapsed: ", round((end-start) / 60, 2), "min." )

In [4]:
tweets = []
with open('tweet_json.txt', 'r') as file:
    data = json.loads(file.read())
    for i in range(0, len(data)):
        record = {"id": data[i]["id"], "retweet_count": data[i]['retweet_count'], "favorite_count": data[i]["favorite_count"]}
        tweets.append(record)

tweets_df = pd.DataFrame(tweets)
tweets_df.head()


Unnamed: 0,favorite_count,id,retweet_count
0,37683,892420643555336193,8215
1,32373,892177421306343426,6076
2,24378,891815181378084864,4017
3,41004,891689557279858688,8370
4,39208,891327558926688256,9075


In [None]:
# save erroneous ids
errors_df= pd.DataFrame(error_ids)
errors_df.to_csv('errors.csv',index=False)

In [None]:
# load errors
errors_df = pd.read_csv('errors.csv')
errors_df.info()

 query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count.

### Assassing Data <a name="assessing-data"></a>
The paragraph text

#### Visual assessment

In [None]:
twitter_dogs_archive

In [None]:
tweets_df

In [None]:
image_predictions

#### Programmatic assessment

In [None]:
# Assess twitter dogs enhanced.
twitter_dogs_archive.info()

In [None]:
# show all retweets
retweets = twitter_dogs_archive[twitter_dogs_archive.retweeted_status_id.notna()]['retweeted_status_id'].values.astype(np.int64)
twitter_dogs_archive[twitter_dogs_archive.retweeted_status_id.notna()]

In [None]:
# check if original tweets are in twitter archive
retweets_reduced = retweets
for retweet in retweets:
    if retweet in twitter_dogs_archive.tweet_id.values:
        index = np.argwhere(retweets_reduced == retweet)
        retweets_reduced = np.delete(retweets_reduced, index)
print(len(twitter_dogs_archive[twitter_dogs_archive.retweeted_status_id.notna()]) - len(retweets_reduced), "entries out of the retweets are contained in the twitter archive.\n")

In [None]:
# show all replies
replies = twitter_dogs_archive[twitter_dogs_archive.in_reply_to_status_id.notna()]['tweet_id'].values.astype(np.int64)
twitter_dogs_archive[twitter_dogs_archive.in_reply_to_status_id.notna()]


In [59]:
# assess names
twitter_dogs_archive.name.value_counts()

None              745
a                  55
Charlie            12
Lucy               11
Oliver             11
Cooper             11
Lola               10
Penny              10
Tucker             10
Bo                  9
Winston             9
Sadie               8
the                 8
Bailey              7
Toby                7
an                  7
Buddy               7
Daisy               7
Stanley             6
Milo                6
Leo                 6
Jax                 6
Dave                6
Koda                6
Rusty               6
Jack                6
Scout               6
Bella               6
Oscar               6
very                5
Sammy               5
Bentley             5
Alfie               5
Larry               5
Finn                5
Gus                 5
Sunny               5
George              5
Phil                5
Oakley              5
Chester             5
Louis               5
Duke                4
Jerry               4
Clarence            4
Riley     

In [14]:
# after finding typical mistakes, I'm checking if there is a pattern to recover names  
determiners = ["a", "an", "the", "officially", "old", "just", "quite", "getting", "actually", "mad", "not", "by", "very", "one", "this", "life", "all", "None"]

# loop trough names column and print each text of the text column whenever name equals determiner
for i, row in twitter_dogs_archive.iterrows():
    if row['name'] in determiners:
        print(i , "-", row['text'])

5 - Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh
7 - When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nONBcwxq
12 - Here's a puppo that seems to be on the fence about something haha no but seriously someone help her. 13/10 https://t.co/BxvuXk0UCm
24 - You may not have known you needed to see this today. 13/10 please enjoy (IG: emmylouroo) https://t.co/WZqNqygEyV
25 - This... is a Jubilant Antarctic House Bear. We only rate dogs. Please only send dogs. Thank you... 12/10 would suffocate in floof https://t.co/4Ad1jzJSdp
30 - @NonWhiteHat @MayhewMayhem omg hello tanner you are a scary good boy 12/10 would pet with extreme caution
32 - RT @Athletics: 12/10 #BATP https://t.co/WxwJmvjfxo
35 - I have a new hero and his name is Howard. 14/10 https://t.co/gzLHboL7Sk
37 - Here we have a corgi und

893 - No no no this is all wrong. The Walmart had to have run into the dog driving the car. 10/10 someone tell him it's ok
https://t.co/fRaTGcj68A
895 - RT @dog_rates: AT DAWN...
WE RIDE

11/10 https://t.co/QnfO7HEQGA
899 - This doggo is just waiting for someone to be proud of her and her accomplishment. 13/10 legendary af https://t.co/9T2h14yn4Q
902 - Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE
905 - In case you haven't seen the most dramatic sneeze ever... 13/10 https://t.co/iy7ylyZcsE
906 - Teagan reads entire books in store so they're free. Loved 50 Shades of Grey (how dare I make that joke so late) 9/10 https://t.co/l46jwv5WYv
911 - RT @jon_hill987: @dog_rates There is a cunningly disguised pupper here mate! 11/10 at least. https://t.co/7boff8zojZ
912 - Here's another picture without a dog in it. Idk why you guys keep sending these. 4/10 just because that's a neat rug https://t.co/mOmnL19Wsl
913 - She walks herself up and down the train to be pet

1550 - We normally don't rate birds but I feel bad cos this one forgot to fly south for the winter. 9/10 just wants a bath https://t.co/o47yitCn9N
1552 - This pupper just wants to say hello. 11/10 would knock down fence for https://t.co/A8X8fwS78x
1554 - When you have a ton of work to do but then remember you have tomorrow off. 10/10 https://t.co/MfEaMUFYTx
1557 - When you stumble but recover quickly cause your crush is watching. 12/10 https://t.co/PMeq6IedU7
1560 - This pupper is sprouting a flower out of her head. 12/10 revolutionary af https://t.co/glmvQBRjv4
1564 - Please send dogs. I'm tired of seeing other stuff like this dangerous pirate. We only rate dogs. Thank you... 10/10 https://t.co/YdLytdZOqv
1566 - 13/10 I can't stop watching this (vid by @k8lynwright) https://t.co/nZhhMRr5Hp
1568 - With great pupper comes great responsibility. 12/10 https://t.co/hK6xB042EP
1574 - Another magnificent photo. 12/10 https://t.co/X5w387K5jr
1579 - "You got any games on your phone" 7/10 for i

2054 - Striped dog here. Having fun playing on back. Sturdy paws. Looks like an organized Dalmatian. 7/10 would still pet https://t.co/U1mSS3Ykez
2056 - Tfw she says hello from the other side. 9/10 https://t.co/lS1TIDagIb
2060 - This pup holds the secrets of the universe in his left eye. 12/10 https://t.co/F7xwE0wmnu
2062 - Pack of horned dogs here. Very team-oriented bunch. All have weird laughs. Bond between them strong. 8/10 for all https://t.co/U7DQQdZ0mX
2065 - *struggling to breathe properly* 12/10 https://t.co/NKHx0pcOii
2066 - This is a Helvetica Listerine named Rufus. This time Rufus will be ready for the UPS guy. He'll never expect it 9/10 https://t.co/34OhVhMkVr
2067 - Neat pup here. Enjoys lettuce. Long af ears. Short lil legs. Hops surprisingly high for dog. 9/10 still very petable https://t.co/HYR611wiA4
2068 - Me running from commitment. 10/10 https://t.co/ycVJyFFkES
2070 - Two miniature golden retrievers here. Webbed paws. Don't walk very efficiently. Can't catch a tenn

In [None]:
twitter_dogs_archive[twitter_dogs_archive.tweet_id == 679062614270468096]

In [None]:
twitter_dogs_archive.doggo.value_counts()

In [None]:
twitter_dogs_archive.floofer.value_counts()

In [None]:
twitter_dogs_archive.pupper.value_counts()

In [None]:
twitter_dogs_archive.puppo.value_counts()

In [None]:
# assess texts
twitter_dogs_archive.text

In [None]:
print("Number of duplicated tweet ids:", len(twitter_dogs_archive[twitter_dogs_archive.tweet_id.duplicated(keep=False)]))
twitter_dogs_archive[twitter_dogs_archive.tweet_id.duplicated()]

In [None]:
# Tweets containing "We only rate dogs caught my attention", however it seems to be a joke for dogs that don't look like dogs. 
# Print every row that contains We only rate dogs" or "We. Only. Rate. Dogs."
import re
pattern = re.compile(r'we.? only.? rate.? dogs', re.IGNORECASE)
for i, row in dogs_clean.iterrows():
    if pattern.search(row['text']):
        print(row['text'])

##### Assessing tweets data

In [None]:
# looking at errors
print("Number of non-existing tweet ids: ", len(errors_df))

In [None]:
tweets_df[tweets_df['retweet_count'].isnull()]

##### Assessing image predictions data

In [None]:
image_predictions.info()

In [None]:
image_predictions.tweet_id.duplicated().sum()

In [None]:
image_predictions.iloc[2059]['jpg_url']

In [None]:
# check if retweets are in image predictions -> no retweets in image predictions
for retweet in retweets_reduced:
    if retweet in image_predictions.tweet_id.values:
        print(retweet)

**Quality**  

_**twitter archive table**_
- irrelevant retweets
- irrelevant columns in_reply_to_status_id, in_reply_to_user_id
- Incorrect dog names, determiners like: a, an, the, officially, old, just, quite, getting, actually, mad, not, by, very, one, this, life, all, None
- incorrect null values in dog stages. None should be NaN.
- Erroneous data types (timestamp, source, dog_stage)
- Source value is wrapped in anchor tag
- Text contains mentions of users, e.g. @NonWhiteHat
- Contains tweets that are replies to other treets --> Remove if in_reply_to_status_id/in_reply_to_user_id is not NaN
- Some tweets contain a link using t.co, twitter's url shortener. They are not active anymore. Working url is included in expanded_urls
- Row 47, 59, 62,91 not a valid observation (We only rate dogs-comments)
- Incorrectly-extracted or None as names, e.g. a row 56, None should be NaN
- Incorrect demoninators (not 10)
- Incomparable rating numerators.
- Tweets with missing photos
- Incorrect dog stages
- dog statuses should be NaN values instead of a string of None

- dogs with multiple dog stage in the following tweet_ids: 854010172552949760, 817777686764523521, 808106460588765185, 802265048156610565, 801115127852503040, 785639753186217984,781308096455073793, 775898661951791106, 770093767776997377, 759793422261743616, 751583847268179968, 741067306818797568, 733109485275860992, 855851453814013952

_**image predictions table**_
- dog races in p1, p2, p3 contain underscores, some are uppercase, some are lowercase
- dog races contain non dog- races, e.g. hen, cock, personal_computer, shopping_cart, box-turtle... --> p_dog is False in this case
- contains 281 fewer entries compared to twitter archive table

_**tweets table**_ 
- contains 17 fewer ids compared to twitter archive due to errors during twitter request 

**Tidiness**  

_**twitter archive table**_
- doggo, floofer, pupper, puppo should be one column
- 3 separate tables of the same purpose
- Multiple urls in expanded_urls (some are duplicates)


### Cleaning Data <a name="cleaning-data"></a>
#### Tidiness


In [38]:
# Create copies of data frames
dogs_clean = twitter_dogs_archive
tweets_clean = tweets_df
images_clean = image_predictions

**dogs: _doggo, floofer, pupper, puppo should be one column_**

_**Define**_

Create a dog_stage column to assign the status of dog to each tweet. Use the 4 separate columns of doggo floofer, pupper and puppo. Leave the value empty if all of the 4 columns contain "None".

_**Code**_

In [40]:
# create a dog status column by using doggo column
column_names = ['tweet_id','in_reply_to_status_id','in_reply_to_user_id', \
                'timestamp','source','text','retweeted_status_id', 'retweeted_status_user_id','retweeted_status_timestamp',\
                'expanded_urls','rating_numerator','rating_denominator','name']

# add column for none dog type
dogs_clean['None'] = "placeholder"

for i, row in dogs_clean.iterrows():
    if row.loc['doggo'] == row.loc['floofer'] == row.loc['pupper'] == row.loc['puppo'] == 'None':
        dogs_clean.at[i, 'None'] = "None"

# melt dog stages into rows
dogs_clean = pd.melt(dogs_clean, id_vars=column_names, var_name='placeholder', value_name='dog_stage')

# remove all excess rows and columns
for i, row in dogs_clean.iterrows():
    if not row.loc['placeholder'] == row.loc['dog_stage']:
        dogs_clean = dogs_clean.drop([i])
        
dogs_clean = dogs_clean.drop(['placeholder'], axis=1).reset_index(drop=True)


_**Test**_

In [41]:
# Data frame must have a min of 2356 non-null values plus 14 tweets with multiple values for dog stages
dogs_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2370 entries, 0 to 2369
Data columns (total 14 columns):
tweet_id                      2370 non-null int64
in_reply_to_status_id         79 non-null float64
in_reply_to_user_id           79 non-null float64
timestamp                     2370 non-null object
source                        2370 non-null object
text                          2370 non-null object
retweeted_status_id           183 non-null float64
retweeted_status_user_id      183 non-null float64
retweeted_status_timestamp    183 non-null object
expanded_urls                 2311 non-null object
rating_numerator              2370 non-null int64
rating_denominator            2370 non-null int64
name                          2370 non-null object
dog_stage                     2370 non-null object
dtypes: float64(4), int64(3), object(7)
memory usage: 259.3+ KB


In [44]:
# no changes in value counts of dog stages
dogs_clean.dog_stage.value_counts()

None       1976
pupper      257
doggo        97
puppo        30
floofer      10
Name: dog_stage, dtype: int64

_**3 separate tables of the same purpose**_

_**Define**_

Join dogs table and tweets table using 'tweet_id', respectively 'id', removing non-matching tweet_ids. Than, join new dogs table and image predictions table on their common tweet_id. Keep all entries with non-matching dog race predictions to not loose entries. 

_**Code**_

In [45]:
# merge dogs and tweets table 
dogs_clean = pd.merge(dogs_clean, tweets_clean, how='inner',  left_on='tweet_id', right_on='id').drop(['id'], axis=1)

# merge dogs and image predications tables 
dogs_clean = pd.merge(dogs_clean, images_clean, how='left',  on='tweet_id')

_**Test**_

In [46]:
# should have all 2353 entries after dropping the missing tweets
# all columns are combined in one table 
dogs_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2353 entries, 0 to 2352
Data columns (total 27 columns):
tweet_id                      2353 non-null int64
in_reply_to_status_id         79 non-null float64
in_reply_to_user_id           79 non-null float64
timestamp                     2353 non-null object
source                        2353 non-null object
text                          2353 non-null object
retweeted_status_id           169 non-null float64
retweeted_status_user_id      169 non-null float64
retweeted_status_timestamp    169 non-null object
expanded_urls                 2294 non-null object
rating_numerator              2353 non-null int64
rating_denominator            2353 non-null int64
name                          2353 non-null object
dog_stage                     2353 non-null object
favorite_count                2353 non-null int64
retweet_count                 2353 non-null int64
jpg_url                       2079 non-null object
img_num                       2079

_**Quality**_

_**irrelevant retweets**_

_**Define**_  
Keep only rows containing a null value in 'retweeted_status_id'. Subsequentially, remove irrelevant columns: 'retweeted_status_id', 'retweeted_status_user_id' and 'retweeted_status_timestamp'.

_**Code**_

In [47]:
dogs_clean = dogs_clean[dogs_clean.retweeted_status_id.isna()]
dogs_clean.drop(['retweeted_status_id', 'retweeted_status_user_id','retweeted_status_timestamp'], axis=1, inplace=True)

_**Test**_

In [48]:
# After removing 169 retweet observations, we should have 2184 observations left.  
dogs_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2184 entries, 0 to 2352
Data columns (total 24 columns):
tweet_id                 2184 non-null int64
in_reply_to_status_id    79 non-null float64
in_reply_to_user_id      79 non-null float64
timestamp                2184 non-null object
source                   2184 non-null object
text                     2184 non-null object
expanded_urls            2126 non-null object
rating_numerator         2184 non-null int64
rating_denominator       2184 non-null int64
name                     2184 non-null object
dog_stage                2184 non-null object
favorite_count           2184 non-null int64
retweet_count            2184 non-null int64
jpg_url                  2002 non-null object
img_num                  2002 non-null float64
p1                       2002 non-null object
p1_conf                  2002 non-null float64
p1_dog                   2002 non-null object
p2                       2002 non-null object
p2_conf                 

_**Incorrect dog names, determiners like: a, an, the, officially, old, just, quite, getting, actually, mad, not, by, very, one, this, life, all, None**_

_**Define**_  
Scan through all missing names, which are all names represented by a, an, the, officially, old, just, quite, getting, actually, mad, not, by, very, one, this, life, all or None. Identify dog names in the text column of the structure "named (something)". Replace the name if there is a name pattern match in the text field. If there is no matching pattern, fill the cell by a null value. 

_**Code**_

In [49]:
clean = dogs_clean

1497

In [53]:
#determiners = ["a", "an", "the", "officially", "old", "just", "quite", "getting", "actually", "mad", "not", "by", "very", "one", "this", "life", "all"]
len(dogs_clean[dogs_clean.name != "None"])

pattern = re.compile(r'named [A-Za-z]+')
def extract_name(row):
    new_name = re.search(pattern, row['text'])
    if new_name:
        new_name = new_name.group()[6:]
    else:
        new_name = np.nan
    return new_name
    
# loop trough names column and print each text of the text column whenever name equals determiner
for i, row in clean.iterrows():
    if row['name'] in determiners:
        new_name = extract_name(row)
        clean.at[i, 'name'] = new_name
        

_**Test**_

In [60]:
# all determiner, except are gone
clean.name.value_counts()

Lucy              11
Charlie           10
Oliver            10
Cooper            10
Tucker             9
Penny              9
Lola               8
Winston            8
Sadie              8
Toby               7
Daisy              7
Oscar              6
Bella              6
Bo                 6
Jax                6
Bailey             6
Stanley            6
Koda               6
Maggie             5
Dave               5
Scout              5
Milo               5
Chester            5
Rusty              5
Bentley            5
Louis              5
Leo                5
Buddy              5
Alfie              4
Scooter            4
Cassie             4
Sophie             4
Brody              4
Gus                4
Archie             4
Clarence           4
Clark              4
Jerry              4
Bear               4
Jack               4
Jeffrey            4
Oakley             4
Larry              4
Reggie             4
George             4
Boomer             4
Dexter             4
Finn         

In [55]:
# names were replaced by a name in their respective text column
# for i, row in clean.iterrows():
#     if row['name'] in determiners:
#         new_name = extract_name(row)
#         clean.at[i, 'name'] = new_name

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,dog_stage,favorite_count,retweet_count,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,31067,7134,https://pbs.twimg.com/media/DFrEyVuW0AAO3t9.jpg,1.0,Pembroke,0.511319,True,Cardigan,0.451038,True,Chihuahua,0.029248,True
1,884162670584377345,,,2017-07-09 21:29:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Yogi. He doesn't have any important dog m...,https://twitter.com/dog_rates/status/884162670...,12,10,Yogi,doggo,19828,2887,https://pbs.twimg.com/media/DEUtQbzW0AUTv_o.jpg,1.0,German_shepherd,0.707046,True,malinois,0.199396,True,Norwegian_elkhound,0.049148,True
2,872967104147763200,,,2017-06-09 00:02:31 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here's a very large dog. He has a date later. ...,https://twitter.com/dog_rates/status/872967104...,12,10,,doggo,26726,5288,https://pbs.twimg.com/media/DB1m871XUAAw5vZ.jpg,2.0,Labrador_retriever,0.476913,True,Chesapeake_Bay_retriever,0.174145,True,German_short-haired_pointer,0.092861,True
3,871515927908634625,,,2017-06-04 23:56:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Napolean. He's a Raggedy East Nicaragu...,https://twitter.com/dog_rates/status/871515927...,12,10,Napolean,doggo,19794,3395,https://pbs.twimg.com/media/DBg_HT9WAAEeIMM.jpg,2.0,komondor,0.974781,True,briard,0.020041,True,swab,0.003228,False
4,871102520638267392,,,2017-06-03 20:33:19 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Never doubt a doggo 14/10 https://t.co/AbBLh2FZCH,https://twitter.com/animalcog/status/871075758...,14,10,,doggo,20536,5412,,,,,,,,,,,


_**Incorrect null values in dog stages. None should be NaN.**_

_**Define**_  
Replace every None string into a numpy null value in name and dog_stage columns. 

_**Code**_

In [None]:
dogs_clean['dog_stage'].replace("None", np.nan, inplace=True)

_**Test**_

In [None]:
# dog stages should have 2184-1828=356 entries, since there were 1828 Nones  
# names should have 2184-87=1497 entries, since there were 687 Nones 
dogs_clean.info()


_**Code**_

_**Test**_

_**Irrelevant columns 'in reply to status id', 'in reply to user id'**_

_**Define**_  
Drop 'in reply to status id' and 'in reply to user id' columns in data set.

_**Code**_

In [None]:
dogs_clean.drop(['in_reply_to_status_id', 'in_reply_to_user_id'], axis=1, inplace=True)

_**Test**_

In [None]:
# 'in reply to status id' and 'in reply to user id' columns were removed
dogs_clean.info()

_**Erroneous data types (timestamp, source, dog stage)**_

_**Define**_  

Convert timestamp into datetime format. Convert source and dog stage into categorical data. 

_**Code**_

In [None]:
# To datetime
dogs_clean.timestamp = pd.to_datetime(dogs_clean.timestamp)

# To category
dogs_clean.source = dogs_clean.source.astype('category')
dogs_clean.dog_stage = dogs_clean.dog_stage.astype('category')

_**Test**_

In [None]:
dogs_clean.info()

## Analysis and Visualization <a name="analysis-and-visualization"></a>
The paragraph text

- Most popular dog names
- most popular dog content 
- rating statistics
- popularity of the account
- Where are users from?
- most popular hashtags
- what race is associated with which dogtype

## Reporting <a name="reporting"></a>
The paragraph text