# Wrangle and Analyze WeRateDogs Data

"The dataset that you will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage."

### Table of Contents

<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
    <ul>
    <li><a href="#gathering">Gathering Data</a></li>
    <li><a href="#assessing">Assessing Data</a></li>
        <ul>
          <li><a href="#issues">Identified Issues</a></li>
        </ul>
    <li><a href="#cleaning">Cleaning Data</a></li>
    </ul>
<li><a href="#sav">Store, Analyze and Visualize</a></li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>

<a id='intro'></a>
## Introduction

dd

#### Import

In [1]:
import pandas as pd
import numpy as np
import requests
import os
import tweepy
import json
import matplotlib as plt
%matplotlib inline

# https://stackoverflow.com/questions/11707586/how-do-i-expand-the-output-display-to-see-more-columns
# To see all rows in datasets in order to help visual assessment
# pd.set_option('display.max_rows', 5000)

<a id='wrangling'></a>
## Data Wrangling

In this section of the report, we will load in the data, check it for cleanliness, and then trim and clean the datasets for analysis. 

<a id='gathering'></a>
### Gathering Data

#### Enhanced WeRateDogs Dataset 

In [4]:
# Read in the dataset
enhanced_df = pd.read_csv('twitter-archive-enhanced.csv')

#### Dog Breed Predictions

In [5]:
# Create folder if it doesn't already exist
folder_name = 'tweet_image_predictions'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

In [6]:
# Load dataset and check response
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
response.status_code

200

In [7]:
# Add data to folder
with open(os.path.join(folder_name, url.split('/')[-1]), mode='wb') as file:
    file.write(response.content)
    
os.listdir(folder_name)

['image-predictions.tsv']

In [8]:
# Read in tsv as csv
predictions_df = pd.read_csv(folder_name + '/image-predictions.tsv', sep='\t')

#### Accessing Twitter API

In [9]:
# # API Keys, Secrets, and Tokens
# consumer_key = ''
# consumer_secret = ''
# access_token = ''
# access_secret = ''

In [10]:
# # Redirect to Twitter and get access token
# auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
# auth.set_access_token(access_token, access_secret)

# # API instance
# # guidance for this was found here: https://stackoverflow.com/questions/28384588/twitter-api-get-tweets-with-specific-id
# api = tweepy.API(auth_handler=auth, 
#                  wait_on_rate_limit=True, 
#                  wait_on_rate_limit_notify=True)


In [11]:
# # Twitter API

# # Split into working tweets list and list of tweets with errors
# tweets = []
# errors = []

# tweet_ids = list(enhanced_df['tweet_id'])

# with open('tweet_json.txt', 'w') as file:
#     for tweet_id in tweet_ids:
#         try:
#             # Get extended tweet information via the id
#             extended_tweet = api.get_status(tweet_id, tweet_mode='extended')
#             json.dump(extended_tweet._json, file)
#             file.write('\n')
            
#             # Add to working tweets list
#             tweets.append(tweet_id)
#             print(tweet_id)
        
#         # Support used to better understand TweepError: https://www.programcreek.com/python/example/13279/tweepy.TweepError
#         except tweepy.TweepError as e:
            
#             # Add to list of tweets with errors
#             errors.append(tweet_id)
#             print(tweet_id, e)

Note: do not include your Twitter API keys, secrets, and tokens in your project submission.

In [12]:
# Read in JSON 
tweet_df = pd.read_json('tweet_json.txt', lines = True, encoding = 'utf-8')

<a id='assessing'></a>
### Assessing Data

In this section, we will visually and programmatically assess the 3 datasets to determine whether or not they hold any quality or tidiness issues.

#### Enhanced twitter dataframe assessment

In [13]:
# Overview of all the data
enhanced_df

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,


In [None]:
# See amount of data within each column, how many rows exist, and the datatypes
enhanced_df.info()

In [None]:
# Check if texts are robust
enhanced_df.text[400]

In [None]:
# See how many images are missing
enhanced_df.expanded_urls.isnull().sum()

In [None]:
# See how many images are duplicated
enhanced_df.expanded_urls.duplicated().sum()

In [None]:
# Find missing data
enhanced_df.isnull().sum()

In [None]:
# Find number of retweets
enhanced_df.retweeted_status_id.notna().sum()

In [None]:
# Look at dog names to ensure they are real names
enhanced_df.name.value_counts()

In [None]:
# How many doggos are there
enhanced_df.doggo.value_counts()

In [None]:
# How many floofers are there
enhanced_df.floofer.value_counts()

In [None]:
# How many puppers are there
enhanced_df.pupper.value_counts()

In [None]:
# How many puppos are there
enhanced_df.puppo.value_counts()

#### Predicted dog type assessment

In [None]:
# Overview of all the data
predictions_df

In [None]:
# See amount of data within each column, how many rows exist, and the datatypes
predictions_df.info()

In [None]:
predictions_df.p1.str.islower().sum()

In [None]:
predictions_df.jpg_url.duplicated().sum()

In [None]:
predictions_df.tweet_id.duplicated().sum()

In [None]:
predictions_df.p1.value_counts()

#### Twitter API data assessment

In [None]:
# Overview of all the data
tweet_df

In [None]:
# See amount of data within each column, how many rows exist, and the datatypes
tweet_df.info()

In [None]:
# Check for missing values
tweet_df.isnull().sum()

In [None]:
tweet_df.id.duplicated().sum()

<a id='issues'></a>
### Identified Issues

#### Quality Issues

##### `enhanced_df` table:
- Retweet information is not needed, so items that are not NaN in `retweet_status_id` can be removed
- Irrelevant columns since we only want to look at original ratings (`in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp`)
- Missing photo URLs for tweets (`expanded_url`) 
- Erroneous datatype (`timestamp` and `retweeted_status_timestamp` are strings) 
- Erroneous dog names ('a', 'an', 'the', 'his', 'such', 'just', 'very', 'one', 'not', 'mad', 'this', 'all','my', 'light', 'by', 'old', 'space', 'officially' and None are not real names)
- Duplicated rows after melting

##### `predictions_df` table:
- About half of `p1`, `p2`, and `p3` are not capitalized
- Duplicated images/rows
- Images predictions of things that are not dogs

##### `tweet_df` table:
- Missing data for entire column (`contributors`,`coordinates`,`geo`)
- Missing data for almost entire column (`in_reply_to_screen_name`,`in_reply_to_status_id`,`in_reply_to_status_id_str`,`in_reply_to_status_id_str`,`in_reply_to_user_id`,`in_reply_to_user_id_str`,`place`,`quoted_status`, `quoted_status_id`,`quoted_status_id_str`, `quoted_status_permalink`, `retweeted_status`)

#### Tidiness Issues
- Single variable `dog_stage` split up into four columns in `enhanced_df` table
- Column name not clear(`id` should be `tweet_id` to match the other tables) in `tweet_df` table
- All tables can be combined into one on `tweet_id`

<a id='cleaning'></a>
### Cleaning Data

In this section, we will target the quality and tidiness issues were identified in the previous section, and we will define these issues in more depth, code the solution, and test to ensure proper functionality. 

Before we begin, we should create copies of our dataframes in order to not alter the originals.

In [14]:
# Create copies
enhanced_df_clean = enhanced_df.copy()
predictions_df_clean = predictions_df.copy()
tweet_df_clean = tweet_df.copy()

In [15]:
enhanced_df_clean.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


#### Retweet information is not needed, so items that are not NaN in `retweet_status_id` can be removed
##### Define

##### Code

In [33]:
# Create new df with only data that has not been retweeted
# enhanced_df_clean[(enhanced_df_clean.retweeted_status_id == 'NaN')]
enhanced_df_clean.retweeted_status_id.isna().values

array([ True,  True,  True, ...,  True,  True,  True])

##### Test

In [19]:
# Check to see if there are any retweets left
enhanced_df_clean.retweeted_status_id.notnull().sum()

0

In [32]:
enhanced_df_cleanest.head()

0    True
1    True
2    True
3    True
4    True
Name: retweeted_status_id, dtype: bool

#### Irrelevant columns since we only want to look at original ratings (`in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp`)
##### Define
Now that the retweeted dog ratings are removed, we can drop all retweet related columns.

##### Code

In [None]:
# Drop all irrelevant columns
enhanced_df_clean.drop(columns=[
    'in_reply_to_status_id', 
    'in_reply_to_user_id', 
    'retweeted_status_id', 
    'retweeted_status_user_id', 
    'retweeted_status_timestamp'], axis=1, inplace=True)

##### Test

In [None]:
# Check the columns remaining
list(enhanced_df_clean)

#### Single variable `dog_stage` split up into four columns in `enhanced_df` table
##### Define
To put all the dog stages into one column, first we will split the `enhanced_df_clean` table into a df with the dog stages and into a df without the dog stages. For the dataframe without dog stages, we will drop all the dog stage columns. For the dataframe with dog stages, we will melt the columns into one with the results. Then we will concatenate the two cleaned dataframes to create the new `enhanced_df_clean`.

##### Code

In [None]:
# Create dataframe without any dog stages
without_stage = enhanced_df_clean[(enhanced_df_clean.doggo == 'None') 
                                  & (enhanced_df_clean.floofer == 'None') 
                                  & (enhanced_df_clean.pupper == 'None') 
                                  & (enhanced_df_clean.puppo == 'None')]

##### Test

In [None]:
# Check to see that none of the rows have any dog stages
without_stage[(without_stage.doggo != 'None') 
                & (without_stage.floofer != 'None') 
                & (without_stage.pupper != 'None') 
                & (without_stage.puppo != 'None')]

In [None]:
# Check to see the total rows for dogs without a stage
without_stage.info()

##### Code

In [None]:
# Drop the dog stage columns since they now have been checked to ensure nothing is within them
without_stage.drop(columns=['doggo', 'floofer', 'pupper', 'puppo'], axis=1, inplace=True)

##### Test

In [None]:
# Check to see if the columns were correctly dropped
without_stage.head(5)

In [None]:
enhanced_df_clean.head()

##### Code

In [None]:
# Create a dataframe for all dogs with a stage
with_stage = enhanced_df_clean[(enhanced_df_clean.doggo != 'None') 
                                  | (enhanced_df_clean.floofer != 'None') 
                                  | (enhanced_df_clean.pupper != 'None') 
                                  | (enhanced_df_clean.puppo != 'None')]

##### Test

In [None]:
# Check samples to ensure that each dog has a stage
with_stage.sample()

In [None]:
# Melt the dog stage columns of doggo, floofer, pupper, and puppo into a single column
with_stage = pd.melt(with_stage, id_vars=['tweet_id', 'timestamp', 'source', 'text', 'expanded_urls', 'rating_numerator', 'rating_denominator', 'name'],
                           var_name='dog_stage', value_name='dog_type')

##### Test

In [None]:
# Check sample to see the results
with_stage.sample(5)

In [None]:
# Check to see if there's repetition with the name since the indices are different
with_stage[(with_stage.name == 'Rinna')]

##### Code

In [None]:
# Drop the rows with None since those are just duplicates, and then drop the dog stage
with_stage = with_stage[with_stage.dog_type != "None"]
with_stage = with_stage.drop('dog_stage', axis=1)

##### Test

In [None]:
# Check sample to ensure it worked out
with_stage.sample(5)

##### Code

In [None]:
# Concatenate the with and without stage dataframes back into a joined dataframe
enhanced_df_clean = pd.concat([with_stage, without_stage], join='outer', sort=False)

##### Test

In [None]:
# Check to see if it concatenated correctly
enhanced_df_clean.sample(10)

In [None]:
# Check to see if all the dog stages are still present
enhanced_df_clean.dog_type.value_counts()

In [None]:
# The following 4 are to ensure the cleaned dataframes dog stages match the original dataframe
enhanced_df.pupper.value_counts()

In [None]:
enhanced_df.doggo.value_counts()

In [None]:
enhanced_df.puppo.value_counts()

In [None]:
enhanced_df.floofer.value_counts()

In [None]:
enhanced_df_clean.dog_type.isna().sum()

In [None]:
enhanced_df_clean.tweet_id.duplicated().sum()

In [None]:
# enhanced_df_clean.tweet_id.duplicated().index
enhanced_df_clean[enhanced_df_clean.tweet_id.duplicated()]

#### Column name not clear(`id` should be `tweet_id` to match the other tables) in `tweet_df` table
##### Define

##### Code

In [None]:
tweet_df_clean.rename(columns={'id':'tweet_id'}, inplace=True)

##### Test

In [None]:
list(tweet_df_clean)

#### All tables can be combined into one on `tweet_id`
##### Define

##### Code

In [None]:
# Merge the three dataframes to create a master dataframe
twitter_archive_master = enhanced_df_clean.merge(predictions_df_clean, how='inner', on='tweet_id').merge(tweet_df_clean, how='inner', on='tweet_id')

##### Test

In [None]:
list(twitter_archive_master)

### Assess - Twitter Archive Master

In [None]:
twitter_master_copy.dog_type.isna().sum()

In [None]:
# Create a copy to continue cleaning the data
twitter_master_copy = twitter_archive_master.copy

### Quality

Duplicated rows after melting

In [None]:
# enhanced_df_clean.tweet_id.duplicated().index
duplicate_dogs = twitter_archive_master[twitter_archive_master.tweet_id.duplicated()]
duplicate_dogs

In [None]:
twitter_archive_master.tweet_id = twitter_archive_master.tweet_id.astype('str')

In [None]:
# twitter_archive_master[twitter_archive_master['tweet_id'] == '775898661951791106']

In [None]:
# Visually assess the duplicates to see if they can be classified and kept
dog_list = list(duplicate_dogs.index)

for x in dog_list:
    print('>', twitter_archive_master.tweet_id[x], '-', twitter_archive_master.text[x], '\n')

In [None]:
twitter_archive_master.drop

#### 1. Missing photo URLs for tweets (`expanded_urls`) in the `enhanced_df` table
##### Define

Since we only want original ratings that have images, the tweets that are missing photo URLs will be dropped. 

##### Code

In [None]:
enhanced_df_clean.expanded_urls.dropna(inplace=True)

#### Test

In [None]:
enhanced_df_clean.expanded_urls.isnull().sum()

#### 2. Erroneous datatype (`timestamp` and `retweeted_status_timestamp` are strings) in the `enhanced_df` table
##### Define
The datatype of `timestamp` and `retweeted_status_timestamp` in the `enhanced_df` table are strings but it should be datetime.

#### Code

In [None]:
# Datatime information found here: https://stackoverflow.com/questions/17134716/convert-dataframe-column-type-from-string-to-datetime
enhanced_df_clean.timestamp = enhanced_df_clean.timestamp.astype('datetime64')
enhanced_df_clean.retweeted_status_timestamp = enhanced_df_clean.retweeted_status_timestamp.astype('datetime64')

##### Test

In [None]:
enhanced_df_clean.info()

#### 3. Erroneous dog names ('a', 'an', 'the', 'such', 'his', 'quite' and None are not real names) in the `enhanced_df` table
##### Define
Since 'a', 'an', 'the', 'such', 'his', 'quite' and None are not names, we should preferably replace these with NaN as to not lead to confusion, and to identify that the name is not provided.

##### Code

In [None]:
# https://stackoverflow.com/questions/17097236/replace-invalid-values-with-none-in-pandas-dataframe
enhanced_df_clean.name.replace(['a','an','the','such','his','None','quite'], np.nan)

##### Test

In [None]:
enhanced_df_clean.name.value_counts()
####### Not working

#### 4. Retweet information is not needed, so items that are not NaN in `retweet_status_id` can be removed
##### Define

##### Code

In [None]:
enhanced_df_clean.head()

In [None]:
enhanced_df_clean = enhanced_df_clean.query(enhanced_df_clean.retweeted_status_id == np.nan)


##### Test

In [None]:
enhanced_df_clean.retweeted_status_id.notnull().index

#### 5. About half of `p1`, `p2`, and `p3` are not capitalized in the `predictions_df` table
##### Define

Capitalize all data in `p1`, `p2`, and `p3` within the `predictions_df_clean` table.

##### Code

In [None]:
predictions_df_clean.p1 = predictions_df_clean.p1.str.capitalize()
predictions_df_clean.p2 = predictions_df_clean.p2.str.capitalize()
predictions_df_clean.p3 = predictions_df_clean.p3.str.capitalize()

##### Test

In [None]:
predictions_df_clean.sample(5)

#### 5. Duplicated images/rows in the `predictions_df` table
##### Define
Identify the data that are duplicated in the `predictions_df` table and keep the first data of the duplicates set.

##### Code

In [None]:
predictions_df_clean.duplicated(keep='first')

##### Test

In [None]:
predictions_df_clean.duplicated().sum()

#### 6. Missing data for entire column (`contributors`,`coordinates`,`geo`) in `tweet_df` table
##### Define
Drop `contributors`, `coordinates`, and `geo` columns in `tweet_df` table

##### Code

In [None]:
# Dropping columns correctly: https://stackoverflow.com/questions/21457917/pandas-dataframe-dropped-column-appearing-again
tweet_df_clean.drop(columns=['contributors', 'coordinates', 'geo'], axis=1, inplace=True)

##### Test

In [None]:
# Confirm columns are gone
list(tweet_df_clean)

#### 7. Missing data for almost entire column (`in_reply_to_screen_name`,`in_reply_to_status_id`,`in_reply_to_status_id_str`,`in_reply_to_status_id_str`,`in_reply_to_user_id`,`in_reply_to_user_id_str`,`place`,`quoted_status`, `quoted_status_id`,`quoted_status_id_str`, `quoted_status_permalink`, `retweeted_status`) in `tweet_df` table
##### Define
Since these columns do not play a major role and are not relevant for the analyses and visualizations that we will conduct, they will be dropped as well. 

##### Code

In [None]:
tweet_df_clean.drop(columns=['in_reply_to_screen_name','in_reply_to_status_id','in_reply_to_status_id_str','in_reply_to_status_id_str','in_reply_to_user_id','in_reply_to_user_id_str','place','quoted_status', 'quoted_status_id','quoted_status_id_str', 'quoted_status_permalink', 'retweeted_status'], axis=1, inplace=True)

##### Test

In [None]:
# Confirm columns are gone
list(tweet_df_clean)

#### 8. Column name not clear(`id` should be `tweet_id` to match the other tables) in `tweet_df` table
##### Define

##### Code

In [None]:
tweet_df_clean.rename(columns={'id':'tweet_id'}, inplace=True)

##### Test

In [None]:
tweet_df_clean.info()

#### 9. Single variable `dog_stage` split up into four columns in `enhanced_df` table
##### Define


##### Code

In [None]:
enhanced_df_clean

In [None]:
enhanced_df_clean.drop(columns=['in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'], axis=1, inplace=True)


In [None]:
enhanced_df_clean

In [None]:
# treatments_clean = pd.melt(treatments_clean, id_vars=['given_name', 'surname', 'hba1c_start', 'hba1c_end', 'hba1c_change'],
#                            var_name='treatment', value_name='dose')
# treatments_clean = treatments_clean[treatments_clean.dose != "-"]
# treatments_clean = treatments_clean.drop('dose', axis=1)

In [None]:
enhanced_df_clean = pd.melt(enhanced_df_clean, id_vars=['tweet_id', 'timestamp', 'source', 'text', 'expanded_urls', 'rating_numerator', 'rating_denominator', 'name'],
                           var_name='dog_stage', value_name='dog_type')
enhanced_df_clean = enhanced_df_clean[enhanced_df_clean.dog_type != "None"]
# enhanced_df_clean = enhanced_df_clean.drop('dog_type', axis=1)

In [None]:
# dog_stages = pd.melt(enhanced_df_clean, 
#                             id_vars=['tweet_id'],
#                             value_vars = ['doggo','floofer','pupper','puppo'],
#                             value_name = 'dog_stage')

##### Test

In [None]:
enhanced_df_clean

In [None]:
# dog_stages.tweet_id.duplicated()
enhanced_df_clean.tweet_id.duplicated().sum()

#### 10. All tables can be combined into one on `tweet_id`
##### Definedd


##### Code

##### Test

<a id='sav'></a>
## Store, Analyze & Visualize

Store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. If additional files exist because multiple tables are required for tidiness, name these files appropriately. Additionally, you may store the cleaned data in a SQLite database (which is to be submitted as well if you do).

Analyze and visualize your wrangled data in your wrangle_act.ipynb Jupyter Notebook. At least three (3) insights and one (1) visualization must be produced.

<a id='conclusion'></a>
## Conclusion

>Create a 300-600 word written report called wrangle_report.pdf or wrangle_report.html that briefly describes your wrangling efforts. This is to be framed as an internal document.

>Create a 250-word-minimum written report called act_report.pdf or act_report.html that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.

Throughout this notebook, we can see... 

A thorough explanation of .. can be found in `wrangle_report.pdf` and more detail into .. can be found in `act_report.pdf`. 