# Project IV - Wrangle and Analyze Data

## Table of Contents
- [Introduction](#intro)
- [Part I - Gathering Data](#gathering)
- [Part II - Assessing Data](#assessing)
- [Part III - Cleaning Data](#cleaning)
- [Part IV - Analyzing Data](#analyzing)


<a id='intro'></a>
### (I)- Introduction

The dataset that we will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

**Our goal:** wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations.

</br>

<a id='gathering'></a>
### (II)- Gathering Data

We would be gathering data from three different sources in three different formats. The three data pieces of information are following:

1) **WeRateDogs Twitter Archive**- This file is provided to us. We will download it manually and fetch it in our project. This file is in `.csv` format and the name of the file is `twitter_archive_enhanced.csv`. This  file contains basic information about the tweets like like tweet ids, timestamp of tweets, sources, names of dog etc.

2) **Tweet Image Predictions File**: This file is about breeds of dogs i.e. what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file is in `.tsv` format and  the name of the file is `image_predictions.tsv`. It is hosted on Udacity's servers. We would download it programmatically using the Requests library from the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

3) **Retweet and Favorite Count File**- `twitter_archive_enhanced.csv` file does not include two important data attributes which are retweet and favorite count of a tweet. So in order to  get these two attributes, we would be using Python's Tweepy library. We would be using tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data and store each tweet's entire set of JSON data in a file called `tweet_json.txt` file. Each tweet's JSON data would be written to its own line. Then we will read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count.

</br>

In [None]:
#Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import requests
import tweepy 
import json
import time
import re

</br>

**For Data File: `twitter_archive_enhanced.csv`**

In [None]:
#reading archive file 'twitter_archive_enhanced.csv'
twitter_archive=pd.read_csv('twitter_archive_enhanced.csv')
twitter_archive.head()

</br>

</br>

**For Data File: `image_predictions.tsv`**

In [None]:
#downloading file's content programmaticaly and saving it into a file 'image_predictions.tsv'.
url='https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response=requests.get(url)

with open ('./image_predictions.tsv', 'wb') as file:
    file.write(response.content)

In [None]:
#fetching the content of the file in a dataframe
image_predictions=pd.read_csv('image_predictions.tsv',sep='\t')
image_predictions.head()

</br>

</br>

**For Data File: `tweet_json.txt`**

In [None]:
#Gathering data from Twitter using Tweepy libraray and saving it in tweet_json.txt file.


#1) setting up tweepy object

consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

In [None]:
#2) collecting tweet IDs from archive and putting them into a list 'tweet_ids' 

tweet_ids=list(twitter_archive['tweet_id'])
tweet_ids

In [None]:
#3)now fetching two missing attributes corresponding to tweet ids


#list for tweet ids which will not be accessible.
errorneous_ids=[]

with open('tweet_json.txt','w') as file:

    for _id in tweet_ids:

        try:
            
            #fetching data corresponding to tweet ID
            t_data=api.get_status(_id,tweet_mode='extended',wait_on_rate_limit = True, wait_on_rate_limit_notify = True)

            #t_data is not in JSON serializable form, converting it into this format
            json_content=t_data._json

            #putting json content in the text file
            json.dump(json_content,file)
            file.write('\n')
        
        except Exception as e:
                
            #saving id in the list errorneous 
            errorneous_ids.append(_id)

#displaying ids which were not accessible.`
errorneous_ids

In [None]:
#4) Reading json content from 'tweet_json.txt' file and extracting rewteet and fav count for each tweet_ids


#list for saving dictionaries
dict_list=[]


with open('tweet_json.txt','r') as file:
    for tweet_data in file:
        
        #converting tweet_data in dict format:
        tweet_data=json.loads(tweet_data)
        
        #creating dictionary,saving both attributes  in it and appending it in 'dict_list'
        dict_list.append({
            
                        'tweet_id': tweet_data['id'],
                        'retweet_count': tweet_data['retweet_count'],
                        'favorite_count': tweet_data['favorite_count'],
                         
                         })
        

    

dict_list

In [None]:
#5) converting list of dictionary in a dataframe

retweet_fav_counts=pd.DataFrame(dict_list, columns=['tweet_id','favorite_count','retweet_count'])
retweet_fav_counts.head()

</br>

</br>

<a id='assessing'></a>
### (III)- Assessing Data

Assessment of data can be done in two ways: 1) Visually 2) Programmatically.

We will be observing our data in both ways.

</br>

**Assessing `twitter_archive` dataframe.**

In [None]:
#visual Assessment
twitter_archive

In [None]:
#checking columns names, their datatype and count of non-null values
twitter_archive.info()

**Columns Description:**

*    **tweet_id:** the unique identifier for each tweet.
*    **in_reply_to_status_id:** if the represented Tweet is a reply, this field will contain the integer representation of the original Tweet’s ID
*    **in_reply_to_user_id:** if the represented Tweet is a reply, this field will contain the integer representation of the original Tweet’s author ID
*    **timestamp:** time when tweet was created
*    **source:** utility used to post the Tweet, as an HTML-formatted string. e.g. Twitter for Android, Twitter for iPhone, Twitter Web Client
*    **text:** actual UTF-8 text of the status update
*    **retweeted_status_id:** if the represented Tweet is a retweet, this field will contain the integer representation of the original Tweet’s ID
*    **retweeted_status_user_id:** if the represented Tweet is a retweet, this field will contain the integer representation of the original Tweet’s author ID
*    **retweeted_status_timestamp:** time of retweet
*    **expanded_urls:** tweet URL
*    **rating_numerator:** numerator of the rating of a dog.
*    **rating_denominator:** denominator of the rating of a dog.
*    **doggo:** one of the 4 dog "stages"
*    **floofer:** one of the 4 dog "stages"
*    **pupper:** one of the 4 dog "stages"
*    **puppo:** one of the 4 dog "stages"

In [None]:
#checking properties of the table
twitter_archive.describe()

In [None]:
#checking if any duplicated tweet ID is present in the dataset
twitter_archive.tweet_id.duplicated().any()

In [None]:
#Finding tweets which are retweeted and their count
retweets=twitter_archive[~twitter_archive.retweeted_status_id.isnull()]
retweets

In [None]:
#counting no. of retweets which are in the dataframe 
retweets.shape[0]

In [None]:
#checking values of rating_numerator feature and their counts
twitter_archive.rating_numerator.value_counts().sort_index()

In [None]:
#checking value of rating_demominator features and their counts
twitter_archive.rating_denominator.value_counts().sort_index()

In [None]:
#checking the names and their counts
twitter_archive.name.value_counts()

In [None]:
#checking doggo columns values and their counts
twitter_archive.doggo.value_counts()

In [None]:
#checking floofer columns values and their counts
twitter_archive.floofer.value_counts()

In [None]:
#checking pupper columns values and their counts
twitter_archive.pupper.value_counts()

In [None]:
#checking puppo columns values and their counts
twitter_archive.puppo.value_counts()

In [None]:
#rows in which at least one dog "stage" is present
dog_stages=twitter_archive[~(twitter_archive.doggo=='None') | ~(twitter_archive.pupper=='None') 
                | ~(twitter_archive.puppo=='None') |~(twitter_archive.floofer=='None')].iloc[:,[0,13,14,15,16]]

dog_stages

In [None]:
#rows where multiple dog "stages" are present

#replacing none to null value
dog_stages['doggo']=dog_stages.doggo.replace('None',np.nan)
dog_stages['floofer']=dog_stages.floofer.replace('None',np.nan)
dog_stages['pupper']=dog_stages.pupper.replace('None',np.nan)
dog_stages['puppo']=dog_stages.puppo.replace('None',np.nan)

#dropping the rows where only one non-null value exits
dog_stages.dropna(axis=0, how='any', thresh=3,inplace=True)
dog_stages.iloc[:,[0,1,2,3,4]]

</br>

### **Assessing `image-predictions.csv` dataframe.**

In [None]:
#visual assessment
image_predictions

In [None]:
#checking columns names, their datatype and count of non-null values
image_predictions.info()

*    **tweet_id:** is the last part of the tweet URL after "status"
*    **p1:** is the algorithm's #1 prediction for the image in the tweet → golden retriever
*    **p1:**_conf is how confident the algorithm is in its #1 prediction → 95%
*    **p1_dog:** is whether or not the #1 prediction is a breed of dog → TRUE
*    **p2:** is the algorithm's second most likely prediction → Labrador retriever
*    **p2_conf:** is how confident the algorithm is in its #2 prediction → 1%
*    **p2_dog:** is whether or not the #2 prediction is a breed of dog → TRUE

In [None]:
#checking properties of the dataframe
image_predictions.describe()

In [None]:
#checking values of p1 and counts
image_predictions.p1.value_counts()

In [None]:
#checking values of p2 and counts
image_predictions.p2.value_counts()

In [None]:
#checking values of p2 and counts
image_predictions.p3.value_counts()

In [None]:
#checking for duplicated urls
with pd.option_context('max_colwidth',200):
    display(image_predictions[image_predictions['jpg_url'].duplicated(keep=False)].sort_values('jpg_url'))

>Note:
In the above result, although there are duplicated url presents, we won't be deleting them as they belong to different tweet ids.It means one of them could be original tweet and one of them could be retweeted tweet.

</br>

</br>

**Assessing `retweet_fav_counts` dataframe.**

In [None]:
#visual assessment
retweet_fav_counts

In [None]:
#checking columns names, their da`otatype and count of non-null values
retweet_fav_counts.info()

In [None]:
#checking properties of the dataframe
retweet_fav_counts.describe()

</br>

</br>

**Quality Issues:**

**1) `tweeter-archive` dataframe:**

  a) Retweets need to be deleted.
 
  b) Features which are not required can be removed.

  c) Data type needs to be changed of these columns: 
  *  `tweet_id`: int->str, 
  * `in_reply_to_status_id`: float->str, 
  * `in_reply_to_user_id`: float->str, 
  * `timestamp`: object->datetime , 
  * `rating_numerator`: int->float.

 
  d) Ratings, which were provided in decimal values, are not properly fetched.

  e)  In some instances, when two fractional parts (#/#) are given in the text, it takes the first fractional part as rating but it is found found that rating is present in second fractional part.

  f) Errorneous names starting with lowercase letters. For example: a, an, officially etc. need to be removed and set as none.

  g) Sources are difficult to read.
  
  h) 'None' values in the dataframe.
  
  

$$$$

**2) `image_predictions` dataframe:**: 

i) Some of the values in p1, p2 and p3 features start with uppercase letter and some of them in lowercase letters.

</br>

</br>

**Tidiness Issues:**

**1)**   Merge dog "stages" column in one column.

**2)**   Add 'retweet_count' and 'favorite_count' features from `retweet_fav_counts` dataframe to `twitter_archive` dataframe.

**3)**   Add prediction data from `image_predictions` dataframe to `twitter_archive` dataframe. 

</br>

<a id='cleaning'></a>
### (IV)- Cleaning Data

In [None]:
#creating copies of dataframe
twitter_archive_clean=twitter_archive.copy()
image_predictions_clean=image_predictions.copy()
retweet_fav_counts_clean=retweet_fav_counts.copy()

</br>

**Retweets need to be deleted.**

**_Define:_**

Keep only original tweets in `tweet_archive_clean` dataframe. Delete retweets.

**_code:_**

In [None]:
#displaying retweets.
twitter_archive_clean[~twitter_archive.retweeted_status_id.isnull()].head()

In [None]:
#droping retweets
twitter_archive_clean=twitter_archive_clean[twitter_archive_clean.retweeted_status_id.isnull()]
twitter_archive_clean

**_test:_**

In [None]:
#displaying retweets again.
twitter_archive_clean[~twitter_archive.retweeted_status_id.isnull()]

</br>

</br>

**Merge dog "stages" column in one column.**

**_Define:_**

Create one column for dog stages which stores values from columns 'doggo', 'floofer', 'pupper', 'puppo' together except none values.
To do this:
*   Melt 'doggo', 'floofer', 'pupper', 'puppo' columns.
*   Create a new dataframe 'df_dog_stages' which holds tweets ids and all the dog stages associated with it.
*   Drop 'dog_stages' column present in 'twitter_archive_clean' dataframe.
*   Merge 'df_dog_stages' dataframe and twitter_archive_clean' dataframe and remove duplicates.
*   If multiple dog stages are present for tweets, then check text associated with these tweets and choose an appropriate dog stage. Then correct it manually.

**_Code:_**

In [None]:
#let's check what is the shape of 'twitter_archive_clean' dataframe at this point
twitter_archive_clean.shape

In [None]:
# 1) Melting four columns 'doggo','floofer','pupper','puppo'


all_columns=list(twitter_archive_clean)
value_vars=['doggo','floofer','pupper','puppo']
id_vars=[x for x in all_columns if x not in value_vars]

twitter_archive_clean=twitter_archive_clean.melt(id_vars=id_vars,value_vars=value_vars,value_name='dog_stages')

#dropping 'variable' column
twitter_archive_clean.drop(columns='variable',axis=1,inplace=True)

#Let's see how it looks like now
twitter_archive_clean


In [None]:
#2) Creating a dataframe for tweet ids and associated all the dog stages with it by using group by function.


#grouping by 'tweet_id' column
grouped=twitter_archive_clean.groupby('tweet_id')['dog_stages']

#collecting dog_stage values of tweet ids together and removing 'None' from them
dict1=[]
for tweet_id, list_of_dog_stages in grouped:
    
    str1=""
    
    for stage in list_of_dog_stages:
        if stage!='None':str1=str1+stage+', '

    dict1.append({
        'tweet_id':tweet_id,
        'dog_stages':str1[:-2]
                })
    

#converting dictionary into dataframe
df_dog_stages=pd.DataFrame(dict1,columns=['tweet_id','dog_stages'])

#replacing empty string to 'None'
df_dog_stages.dog_stages=df_dog_stages.dog_stages.replace('','None')

#let's look at the dataframe now
df_dog_stages

In [None]:
#3) Now we have a dataframe with tweet ids and associated dog stages with it(dataframe: 'df_dog_stages' ). We would like to 
#merge it with dataframe 'twitter_archive_clean'. But beofre merging both dataframes, we would want to delete 'dog_stages
#column present in 'twitter_archive_clean' as it will no longer useful for us.


#droping 'dog_stages' column from 'twitter_archive_clean'
twitter_archive_clean.drop(columns=['dog_stages'],axis=1,inplace=True)
twitter_archive_clean

#now merging both the dataframes on 'tweet_id'
twitter_archive_clean=twitter_archive_clean.merge(df_dog_stages,on='tweet_id',how='inner')

twitter_archive_clean

In [None]:
#4) In 'twitter_archive_clean'there are 4 rows for each 'tweet_id', eliminating all duplicated rows now
twitter_archive_clean.drop_duplicates(subset=['tweet_id'],inplace=True)
twitter_archive_clean

In [None]:
#5) Now let's have a look at the tweets where multiple dog stages are present. We will check the text of these tweets 
#and choose appropriate dog stages. In case of ambiguous data, we will set it to None.

with pd.option_context('max_colwidth', 200):
    display(twitter_archive_clean[(twitter_archive_clean.dog_stages!='doggo') &\
                      (twitter_archive_clean.dog_stages!='floofer') &\
                      (twitter_archive_clean.dog_stages!='pupper') &\
                      (twitter_archive_clean.dog_stages!='puppo')&\
                      (twitter_archive_clean.dog_stages!='None')].iloc[:,[0,5,13]])

In [None]:
 #correcting dog_stages manually
twitter_archive_clean.loc[660,'dog_stages']='puppo'
twitter_archive_clean.loc[688,'dog_stages']='None'   #ambiguous text
twitter_archive_clean.loc[1528,'dog_stages']='pupper'
twitter_archive_clean.loc[1768,'dog_stages']='None'  #ambiguous text
twitter_archive_clean.loc[1868,'dog_stages']='None'  #ambiguous text
twitter_archive_clean.loc[1896,'dog_stages']='pupper'
twitter_archive_clean.loc[2268,'dog_stages']='doggo'
twitter_archive_clean.loc[2372,'dog_stages']='None'  #ambiguous text
twitter_archive_clean.loc[2888,'dog_stages']='None'  #ambiguous text
twitter_archive_clean.loc[3124,'dog_stages']='None'  #ambiguous text
twitter_archive_clean.loc[3540,'dog_stages']='None'  #ambiguous text
twitter_archive_clean.loc[3740,'dog_stages']='None'  #ambiguous text 

**_Test:_**

In [None]:
#let's have a look on the dataframe now
twitter_archive_clean

In [None]:
#Now let's look at columns of 'twitter_archive_clean' dataframe
twitter_archive_clean.columns

In [None]:
#shape of 'twitter_archive_clean' dataframe
twitter_archive_clean.shape

In [None]:
#Now let's look at the values of 'dpg_stages' column
twitter_archive_clean.dog_stages.value_counts()

</br>

</br>

 **Add two columns: retweet_count and fav_count from `retweet_fav_counts` dataframe to `twitter_archive` dataframe.**

**_Define:_**

Merge both the dataframes on 'tweet_id' column.

**_Code:_**

In [None]:
twitter_archive_clean=twitter_archive_clean.merge(retweet_fav_counts_clean,how='inner',on='tweet_id')

**_Test:_**

In [None]:
#After merging, columns present in 'twitter_archive_clean' dataframe
twitter_archive_clean.columns

</br>

</br>

 **Add prediction data from `image_predictions` dataframe to `twitter_archive` dataframe.**

**_Define:_**

*   Merge both the dataframes: 'image_predictions' and 'twitter_archive'. 

*   Then for each tweet id, find the breed of the dog and save it into the list 'breed' and correspondig confidence level in the list 'confidence_level'. 

*   Add both the lists as columns in original dataframe 'twitter_archive'.

**_Code:_**

In [None]:
#merging both the dataframes
twitter_archive_clean=twitter_archive_clean.merge(image_predictions_clean,how='inner',on='tweet_id')
twitter_archive_clean

In [None]:
#Finding breed and confidence interval

breed=[]
confidence_level=[]

def breed_and_confidence(dataframe):
    
    if dataframe['p1_dog'] == True:
        breed.append(dataframe['p1'].lower())
        confidence_level.append(dataframe['p1_conf'])
    elif dataframe['p2_dog'] == True:
        breed.append(dataframe['p2'].lower())
        confidence_level.append(dataframe['p2_conf'])
    elif dataframe['p3_dog'] == True:
        breed.append(dataframe['p3'].lower())
        confidence_level.append(dataframe['p3_conf'])
    else:
        breed.append('None')
        confidence_level.append(0)
        
twitter_archive_clean.apply(breed_and_confidence,axis=1)

#Adding 'breed' colunn in the dataframe
twitter_archive_clean['breed']=breed

#Adding 'confidence_level' colunn in the dataframe
twitter_archive_clean['confidence_level']=confidence_level


**test:**

In [None]:
#Let's look at the dataframe 
twitter_archive_clean

In [None]:
#columns in the dataframe
twitter_archive_clean.columns

</br>

</br>

Delete extraneous columns.

**_Define:_**

Delete all the columns which will not be of any use for our exploration.. They are:
*   in_reply_to_status_id', 
*   'in_reply_to_user_id',
*   'retweeted_status_id', 
*   'retweeted_status_user_id',
*   'retweeted_status_timestamp', 
*   'expanded_urls',
*   'jpg_url', 
*   'img_num', 
*   'p1', 
*   'p1_conf', 
*   'p1_dog', 
*   'p2',
*   'p2_conf', 
*   'p2_dog', 
*   'p3', 
*   'p3_conf', 
*   'p3_dog'

**_Code:_**

In [None]:
#columns in 'twitter_archive_clean' dataframe
twitter_archive_clean.columns

In [None]:
#colums which need to be removed
del_columns=['in_reply_to_status_id', 'in_reply_to_user_id',
       'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls',
        'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog', 'p2',
       'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog']

#dropping the columns
twitter_archive_clean.drop(columns=del_columns,axis=1,inplace=True)

**_Test:_**

In [None]:
#displaying columns
twitter_archive_clean.columns

</br>

</br>

**Data type needs to be changed of some columns.**

**_Define_:**

Data type needs to be changed of these columns.

*  `tweet_id`: int->str, 
*  `timestamp`: object->datetime, 
*  `rating_numerator`: int->float.

(Other columns mentioned for this issue in 'Quality Issues' part , have been removed as they were not important to us)

**_Code:_**

In [None]:
#displaying datatype of columns
twitter_archive_clean.info()

In [None]:
twitter_archive_clean.tweet_id=twitter_archive_clean.tweet_id.astype(str)
twitter_archive_clean.timestamp= pd.to_datetime(twitter_archive_clean.timestamp)
twitter_archive_clean.rating_numerator=twitter_archive_clean.rating_numerator.astype(float)

**_Test:_**

In [None]:
#let's check the datatype of all the columns now
twitter_archive_clean.info()

</br>

</br>

**Ratings, which were provided with decimal values, are not properly fetched.**

**_Define:_**

Manually fix all the records where rating had been give with decimal values in 'text' columns.

**_Code:_**

In [None]:
#displaying all the records where rating had been given with decimal values in text
with pd.option_context('max_colwidth',150):
    display(twitter_archive_clean[twitter_archive_clean.text.str.contains(r'(\d+\.\d*\/\d+)')].iloc[:,[0,3,4,5]])

In [None]:
# now updating 'rating_numerator' column manually.

twitter_archive_clean.loc[(twitter_archive_clean.tweet_id == '883482846933004288'), 'rating_numerator'] = 13.5
twitter_archive_clean.loc[(twitter_archive_clean.tweet_id == '786709082849828864'), 'rating_numerator'] = 9.75
twitter_archive_clean.loc[(twitter_archive_clean.tweet_id == '778027034220126208'), 'rating_numerator'] = 11.27
twitter_archive_clean.loc[(twitter_archive_clean.tweet_id == '680494726643068929'), 'rating_numerator'] = 11.26

**_Test:_**

In [None]:
#displaying all the records where rating had been given with decimal values in text
with pd.option_context('max_colwidth',150):
    display(twitter_archive_clean[twitter_archive_clean.text.str.contains(r'(\d+\.\d*\/\d+)')].iloc[:,[0,3,4,5]])

</br>

</br>

**In some instances, when two fractional parts ( # / #) are given in the text, it takes the first fractional part as rating but it is found found that rating is present in second fractional part.**


**_Define:_**

Extract second fractional part from the text using regular expression and find_all() function and change the value in 'rating_numerator' column.

**_Code:_**

In [None]:
#let's see how many text have more than one fractional values.
with pd.option_context('max_colwidth',150):
    display(twitter_archive_clean[twitter_archive_clean.text.str.contains(r'(\d+\.?\d*\/\d+).*((\d+\.?\d*\/\d+))')].iloc[:,[0,3,4,5]])

In [None]:
#from above result, saving indexes of the rows where ratings are inaccurate
indexes_needs_fixing=[798,889,923,1326,1970]

#now changing the numerator value for this indexes
for i in indexes_needs_fixing:
    twitter_archive_clean.loc[i,'rating_numerator']=float(re.findall(r"\d+\.?\d*\/\d+\.?\d*\D+(\d+\.?\d*)\/\d+\.?\d*",twitter_archive_clean.loc[i,'text'])[0])
    

**_Test:_**

In [None]:
#let's see the rows again
with pd.option_context('max_colwidth',150):
    display(twitter_archive_clean.loc[indexes_needs_fixing])

</br>

</br>

**Errorneous names starting with lowercase letters. For example: a, an, officially etc. need to be removed and set as none.**

**_Define:_**

Replace all the names starting with lowercase letters to 'None'.

**_Code:_**

In [None]:
twitter_archive_clean.name.value_counts()

In [None]:
names=[]
def name_correction(dataframe):
    name=dataframe['name']
    if name==name.lower():
        names.append('None')
    else:
        names.append(name)

twitter_archive_clean.apply(name_correction,axis=1)
twitter_archive_clean['name']=names

**_Test:_**

In [None]:
twitter_archive_clean.name.value_counts()

</br>

</br>

**_Define:_**

Sources are difficult to read.

**_Code:_**

In [None]:
#let'see how values in source column look
with pd.option_context('max_colwidth',150):
    display(twitter_archive_clean.source)

In [None]:
#extract required part from source
twitter_archive_clean['source']=twitter_archive_clean.source.str.extract(r'>(.*)<',expand=True)

**_Test:_**

In [None]:
twitter_archive_clean.source

In [None]:
twitter_archive_clean.source.value_counts()

</br>

</br>

**'None' values are present in the dataframe instead if null**

**_Define_:**

Find columns where 'None' values are present and replace it with null.

**_Code:_**

In [None]:
#checking count of non-null values in the dataframe
twitter_archive_clean.info()

In [None]:
#checking count of 'None' value in 'name' column
twitter_archive_clean.name.value_counts().head()

In [None]:
#checking count of 'None' value in 'dog_stages' column
twitter_archive_clean.dog_stages.value_counts().head()

In [None]:
#checking count of 'None' value in 'breed' column
twitter_archive_clean.breed.value_counts().head()

In [None]:
#replacing 'None' values to  null in all 3 columns
twitter_archive_clean.name=twitter_archive_clean.name.replace('None',np.nan)
twitter_archive_clean.dog_stages=twitter_archive_clean.dog_stages.replace('None',np.nan)
twitter_archive_clean.breed=twitter_archive_clean.breed.replace('None',np.nan)

**test:**

In [None]:
#checking count of non-null values in the dataframe
twitter_archive_clean.info()

### Storing Data

In [None]:
twitter_archive_clean.to_csv('twitter_archive_master.csv', encoding='utf-8', index=False)

</br>

</br>

<a id='analyzing'></a>
### (IV) Analyzing Data

**1) How does tweet distribution look like year and month wise?**

In [None]:
#grouping the dataset by 'year' and 'month'
grouped=twitter_archive_clean.groupby([twitter_archive_clean.timestamp.dt.year,twitter_archive_clean.timestamp.dt.month])['tweet_id'].count()
grouped

In [None]:
#setting up position values on x and y axis for plotting a graph
y_pos=[y for y in grouped]
x_pos=np.arange(1,len(y_pos)+1)
x_ticks=list(twitter_archive_clean.groupby([twitter_archive_clean.timestamp.dt.year,twitter_archive_clean.timestamp.dt.month]).groups.keys())
x_ticks=[str(x) for x in x_ticks]
x_ticks

In [None]:
# now plotting the graph

plt.figure(figsize=(25,8))
plt.bar(x_pos,y_pos,tick_label=x_ticks)
plt.xlabel('(Year,month)');
plt.ylabel('No. of tweets');
plt.title('Month, Year vs No. of Tweets');

**2) What is the repartition of the dog stages?**

In [None]:
dogstages_count = twitter_archive_clean.dog_stages.value_counts()
dogstages_count

In [None]:
#creating a pie chart.
explode = np.linspace(0,.1,4)
colors = ['#52BE80', '#E59866', '#EC7063','#5DADE2']
dogstages_count.sort_values(ascending=True).plot.pie(legend=True, subplots=True, autopct='%.2f%%', figsize=(8,8), explode=explode,colors = colors);
plt.ylabel('')
plt.title('Repartition of dog stages', weight='bold', fontsize=16);

**3) Which breed has got more likes and how retweet counts look like for those breeds?**

In [None]:
#grouping by breed and finding total number of likes for each breed.
breed_likes=twitter_archive_clean.groupby('breed')['favorite_count'].sum().sort_values(ascending=False).head(10)

#saving the index of the result
index=breed_likes.index

breed_likes

In [None]:
#grouping by breed and finding total number of retweets for each breed
breed_retweets=twitter_archive_clean.groupby('breed')['retweet_count'].sum().sort_values(ascending=False)

#consideing the only breeds which were present in top 8 most liked breeds
breed_retweets=breed_retweets.loc[index]

breed_retweets

In [None]:
#plotting a graph.

fig=plt.figure()
ax = fig.add_subplot(111) # Creates matplotlib axes
ax2 = ax.twinx() # Creates another axes that shares the same x-axis as ax.
width = 0.4
breed_likes.plot(figsize = (10,6), kind='bar', color='#C0392B', ax=ax, width=width, position=1, title='Popular Breeds: Likes vs. Retweets')
breed_retweets.plot(figsize = (10,6), kind='bar', color='#3498DB', ax=ax2, width=width, position=0)

ax.set_ylabel('Likes')
ax2.set_ylabel('Retweets')

ax.set_xticklabels(index, rotation=60)

h1, l1 = ax.get_legend_handles_labels()
h2, l2 = ax2.get_legend_handles_labels()

plt.legend(h1+h2, l1+l2, loc=1)
plt.show()

**4) Where do the tweets come from?**

In [None]:
source_count=twitter_archive_clean.source.value_counts()
source_count

In [None]:
#creating pie chart

explode = np.linspace(0,.3,3)
colors = ['#52BE80', '#E59866', '#EC7063']
source_count.sort_values(ascending=True).plot.pie(legend=True, subplots=True, autopct='%.2f%%', figsize=(8,8), explode=explode,colors = colors);
plt.ylabel('')
plt.title('Repartition of source', weight='bold', fontsize=16);

**5) Is there any relationship between likes and retweets?**

In [None]:
twitter_archive_clean.plot(kind = 'scatter', x = 'favorite_count', y = 'retweet_count', alpha = 0.5,figsize=(15,6))
plt.xlabel('Likes')
plt.ylabel('Retweets')
plt.title('Relationship between Retweets & Likes');