## Wrangle Report - WeRateDogs Twitter Data
### by Arvind Sharma

### Introduction

This paper provides details on data wrangling process performed to analyze and produce visualizations and insights from 'WeRateDogs' data on twitter. [WeRateDogs](https://twitter.com/dog_rates) is a twitter account which rates people's dog with a humurous comment about the dog. WeRateDogs has over 7 million followers till date and has received international media coverage. But first, let's define what is data wrangling.

As per Wikepedia defintion, Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

Data wrangling process consisted of three stages:

1. Gather data
2. Assess data
3. Clean data

### Gathering data

In this case, we had to collect data from three sources:

1. A file at hand that was made available by Udacity and was downloaded manually. The file is Twitter archive file for 'WeRateDogs' account. This file was provided in csv format as 'twitter-archive-enhanced.csv'. It had the major chunk of the data about tweets of the WeRateDogs account from 2015 to 2017. Utilized pd.read_csv() method (where pd is alias for pandas dataframe) to read the file and store data in the form of a pandas dataframe. Code snippet provided below:
    - ```
        df_twitter_archive = pd.read_csv('twitter-archive-enhanced.csv', encoding='UTF-8')
      ```
<br><br>
2. A file that had to be downloaded programmatically from Udacity servers. The file is tweet image predictions file which had the image predictions of machine learning algorithms applied on tweet images for WeRateDogs twitter account. Utilized the Requests library to read the data from server in a request object and write data into 'image_predictions.tsv' file on local file system. Utilized pd.read_csv() method of pandas again to read the file and store data in the form of a pandas dataframe. Code snippet provided below:
    - ```
        r = requests.get(url)
        with open('image_predictions.tsv', 'wb') as file:
            file.write(r.content)
        
        df_image_predications = pd.read_csv('image_predictions.tsv', sep='\t')
      ```
<br><br>
3. Data to be queried from the Twitter API for each tweet's JSON data using Python's Tweepy library using the tweet IDs found in the file at hand (downloaded manually from step 1). The Tweepy API is an easy to use Python-based API which connects to a twitter account using secret and public keys. Once authenticated, one can easily scrap tweets off twitter. Code snippet provided below:
    - ```
        import config
        import tweepy
        from twython import Twython, TwythonError

        auth = tweepy.OAuthHandler(config.api_key, config.api_secret)
        auth.set_access_token(config.access_token, config.token_secret)
        api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
      ```

### Assessing data

Assessing data helps build a list of quality and tidiness issues to be fixed to enable a reliable analysis later on. I used both visual (using MS Excel to open and view csv files) and programmatic assessment (using different methods in the pandas library) in assessing data. Some examples of visual and programmatic assessments performed are as follows.
- Programming assessment: Since since there were three different data sources, the derived dataframes from them namely 'df_twitter_archive', 'df_image_predications' and 'df_api' had different number of entries which were 2356, 2075, 2342 respectively which was a data quality issue. Obtained using .info() method on the dataframe.<br><br>
- Visual and Programmatic: Dog stage was being represented in 4 different columns namely 'doggo', 'floofer', 'pupper' and 'puppo' in 'twitter_archive_enhanced' instead of 1 column of categorical type. This was a tidiness issue as this dealt with structure of data. Obtained using .head() method on the dataframe.<br><br>
- Visual and Programmatic: Dataframes 'image_predictions.tsv' and 'df_api' simply contain image information, retweet count, favorite count etc. variables about the tweets which are a part of the dataframe 'df_twitter_archive' and hence should be merged. This was again a tidiness issue as this was structural in nature. Obtained using .info() and .head() methods on the respective dataframes.<br><br>
- Programmatic assessment: There were certain non-null values for the columns 'in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id' and 'retweeted_status_timestamp' in 'twitter_archive_enhanced' which represented that these tweets were either replies or retweets but not original tweets. Such tweets were not required for analysis. This is was a data quality issue. Obtained using .null() and .unique() methods on the dataframe.<br><br>
- Programmatic assessment: Some values for the 'name' column in dataframe 'df_twitter_archive' started with lower case characters which were not valid names but random words. This is was a data quality issue. Other issues such as 'None' values for 'name' column when valid dog names were present in the 'text' column. Obtained using .islower(), .unique() and .value_counts() methods on the dataframe.<br><br>
- Programmatic assessment: Certain outliers were found in the values for the columns 'rating_numerator' and 'rating_denominator' in the dataframe 'df_twitter_archive' such as the value 0. Also cross checked these values by extracting actual numerator and denominator values from the 'text' column and found other issues such as decimal values not being represented accurately and taking wrong values where other digits are present in text. These were all data quality issues. Obtained by using .unique() and .value_counts() methods on the dataframe.<br><br>
- Visual and Programmatic: Dog breeds started with lower case characters in columns 'p1', 'p2', 'p3' in 'df_image_predictions' and contain the underscore characters. Again, these were data quality issues. Obtained using .head() method on the dataframe.<br><br>

The final list of issues found was divided into two main types:
1. Quality issues: Caused by dirty data i.e. herein the data has problems with its content. Common data quality issues include missing data, invalid data, inaccurate data, and inconsistent data.<br><br>
2. Tidiness issues: Caused due to the structure of the data. It can also be referred to as messy data. Common tidiness issues include mutliple column values for one variable, multiple tables for one observational unit which requires merging.

### Cleaning data

Cleaning data is the final step in the data wrangling process where we fix quality and tidiness issues identified from the previous stage of assessing data. We had to revisit "Assessing Data" section multiple times as we usually didn't find all the quality and tidiness issues at one go. So there was a constant back and forth between the stages of data wrangling process.

We followed the following structure while cleaning the data:

- Define - Providng detail on how to fix the data quality or tidiness issue.
- Code - Python code to fix the issue.
- Test - Python code to check whether the issue has been fixed.

Some major steps performed under the process to clean data were as follows.

1. We started by making copies of all the three dataframes namely 'df_twitter_archive', 'df_image_predications' and 'df_api' using the .copy() method to prevent changing the original datasets and preserve the original data in case we need it later.<br><br>
```
    df_twitter_archive_clean = df_twitter_archive.copy()
    df_image_predications_clean = df_image_predications.copy()
    df_api_clean = df_api.copy()
```
<br><br>

2. Found the common set of tweets in the three dataframes 'df_twitter_archive', 'df_image_predications' and 'df_api' and dropped the rest to retain consistency in our data. The following steps were followed.
    - Using .set() method got the set of unique tweet_id's from the dataframe having least number of observations.
    - Using .isin() method filtered all set of unique tweets from the second smallest dataframe which are also a part of the tweets from above step.
    - Again using .isin() method filtered all set of unique tweets from the largest dataframe which are also a part of the tweets from above step.
<br><br>
Code snippet provided below.


```
common_tweets = set(df_image_predications_clean.tweet_id)
df_api_clean = df_api_clean[df_api_clean.tweet_id.isin(common_tweets)]

common_tweets = set(df_api_clean.tweet_id)
df_twitter_archive_clean = df_twitter_archive_clean[df_twitter_archive_clean.tweet_id.isin(common_tweets)]
df_image_predications_clean = df_image_predications_clean[df_image_predications_clean.tweet_id.isin(common_tweets)] 
```
<br><br>

3. Added a new column 'dog_stage' populated with appropriate value from 'text' column. Removed individual dog stage columns namely 'doggo', 'floofer', 'pupper' and 'puppo' in 'df_twitter_archive'. The following steps were followed.
    - Appended the values of all dog stages namely 'doggo', 'floofer', 'pupper' and 'puppo' as the value for new column 'dog_stage'.
```
df_twitter_archive_clean['dog_stage'] = df_twitter_archive_clean.doggo + df_twitter_archive_clean.floofer + \
df_twitter_archive_clean.pupper + df_twitter_archive_clean.puppo
```
    - Used .value_counts() method to obtain all the unique values for the new column 'dog_stage'.
```
    df_twitter_archive_clean.dog_stage.value_counts()

    output:
                    1745
    pupper          211 
    doggo           67  
    puppo           23  
    doggopupper     11  
    floofer         7   
    doggopuppo      1   
    doggofloofer    1   
    Name: dog_stage, dtype: int64
```
    - Renamed all the instances of dog stages namely 'doggo', 'floofer', 'pupper', 'puppo' and multiple values such as 'doggopupper', 'doggopuppo', 'doggofloofer' appropriately.
```
    df_twitter_archive_clean.dog_stage.replace('', 'None', inplace = True)
    df_twitter_archive_clean.dog_stage.replace('pupper', 'Pupper', inplace = True)
    df_twitter_archive_clean.dog_stage.replace('doggo', 'Doggo', inplace = True)
    df_twitter_archive_clean.dog_stage.replace('puppo', 'Puppo', inplace = True)
    df_twitter_archive_clean.dog_stage.replace('doggopupper', 'Multiple', inplace = True)
    df_twitter_archive_clean.dog_stage.replace('floofer', 'Floofer', inplace = True)
    df_twitter_archive_clean.dog_stage.replace('doggopuppo', 'Multiple', inplace = True)
    df_twitter_archive_clean.dog_stage.replace('doggofloofer', 'Multiple', inplace = True) 
```
    - Deleted the individual dog stage columns.
```
    df_twitter_archive_clean = df_twitter_archive_clean.drop('doggo', axis = 1)
    df_twitter_archive_clean = df_twitter_archive_clean.drop('floofer', axis = 1)
    df_twitter_archive_clean = df_twitter_archive_clean.drop('pupper', axis = 1)
    df_twitter_archive_clean = df_twitter_archive_clean.drop('puppo', axis = 1)
```

<br><br>

4. Converted column 'dog_stage' to categorical type.<br><br>
```
    df_twitter_archive_clean.dog_stage = df_twitter_archive_clean.dog_stage.astype('category')
```
<br><br>

5. As 'df_image_predications' and 'df_api' tables were not different and unique observational units therefore merged them with 'twitter_archive_enhanced'. 

```
    df_master_clean = pd.merge(df_twitter_archive_clean, df_image_predications_clean, on='tweet_id', how='left')
    df_master_clean = pd.merge(df_master_clean, df_api_clean, on='tweet_id', how='left')
```
<br><br>

6. Converted column 'timestamp' from a string object to a datetime object to help plotting on a time scale.
```
    df_master_clean['timestamp'] = pd.to_datetime(df_master_clean['timestamp'])
```
<br><br>

7. Removed entries which were retweets and replies.
    - To find the entries which are retweets we used the feature mask 'retweeted_status_id' along with .notnull() method.
```
    df_master_clean.drop(df_master_clean[df_master_clean.retweeted_status_id.notnull()].index, inplace=True)
```
    - To find the entries which are replies to other tweets we used the feature mask 'in_reply_to_status_id' along with .notnull() method.
```
    df_master_clean.drop(df_master_clean[df_master_clean.in_reply_to_status_id.notnull()].index, inplace=True)
```
<br><br>

8. After deleting entries which are retweets or replies, columns 'in_reply_to_status_id', 'in_reply_to_user_id', 'source', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp', 'expanded_urls' did not provide much value for the sake of this analysis and hence dropped these columns.
```
    df_master_clean.drop(['in_reply_to_status_id', 'in_reply_to_user_id', 'source', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp', 'expanded_urls'], axis=1, inplace=True)
```
<br><br>

9. Updated 'name' values for entries by extracting from 'text' column, where values contains strings such as 'named' or 'name is' etc. along with valid dog names. Also updated appropriate values of column 'name' set as 'None' when the 'text' column contains valid dog name.  Following steps were taken.
    - Got a visual representation of all entries with column 'name' values starting with lower case characters by using the feature mask 'name' along with .str.islower() method. We got 98 such entries.
```
    df_master_clean[df_master_clean.name.str.islower()]
```
    - Updated 'name' values for all such entries where 'text' column contains strings 'named' or 'name is' by extracting appropriate name values from text using regular expressions.
```
    import re

    count = 0
    for _ in df_master_clean[df_master_clean.name.str.islower()]['text']:
        if 'named' in _:
            mask = df_master_clean.text == _
            df_master_clean.loc[mask, 'name'] = re.findall(r"named\s(\w+)", _)
        elif 'name is' in _:
            mask = df_master_clean.text == _
            df_master_clean.loc[mask, 'name'] = re.findall(r"name is\s(\w+)", _)
        else:
            mask = df_master_clean.text == _
            df_master_clean.loc[mask, 'name'] = "None"
```
<br><br>

10. Correcting the values for columns 'rating_numerator' and 'rating_denominator' by extracting appropriate values from the 'text' column. Added a 'rating' column by calculating values for 'rating_numerator' divided by 'rating_denominator'. Other outliers were also corrected. Following steps were taken.
    - Extracted rating values from text using regular expressions and added to a new column 'rating'.
```
    df_master_clean['rating'] = df_master_clean.text.str.extract(('(\d+\.?\d*\/\d+)'), expand=False)
```
    - Viewed the unique values of the newly created 'rating' column using .value_counts() method.
    - Upon visual inspection found out that rating '420/10' of tweet_id=670842764863651840 was a meme rating for the rap artist Snoop Dogg and hence deleted the entry.
    - Value 24/7 for 'rating' for tweet_id=810984652412424192 didn't contain dog rating but 24/7 represented a timeline and hence deleted this entry.
    - Updated the 'rating' value for tweet_id=716439118184652801 from incorrect value 50/50 to the correct value 11/10.
    - Viewed entries with more than one fractional values in column 'text' using regular expression.
```
    df_master_clean[df_master_clean.text.str.contains(('\d+\.?\d*\/\d+\.?\d*\D+(\d+\.?\d*\/\d+\.?\d*)'))]
```
        - Some tweets have more than one dog so more than one rating. Corrected the rating values for such entries individually
```
    df_master_clean.loc[df_master_clean.text == no_ratings_text1, 'rating'] = '9/10'
    df_master_clean.loc[df_master_clean.text == no_ratings_text2, 'rating'] = '8/10'
    df_master_clean.loc[df_master_clean.text == no_ratings_text3, 'rating'] = '8/10'
    df_master_clean.loc[df_master_clean.text == no_ratings_text4, 'rating'] = '7/10'
    df_master_clean.loc[df_master_clean.text == no_ratings_text5, 'rating'] = '9/10'
```
        - Updated correct values for columns 'rating_numerator' and 'rating_denominator' and revaluting float values for column 'rating'.
```
    df_master_clean['rating_numerator'] = df_master_clean['rating'].apply(lambda x: x.split('/')[0])
    df_master_clean['rating_denominator'] = df_master_clean['rating'].apply(lambda x: x.split('/')[1])
    df_master_clean['rating_numerator'] = df_master_clean.rating_numerator.astype(float)
    df_master_clean['rating_denominator'] = df_master_clean.rating_denominator.astype(float)
    df_master_clean['rating'] = df_master_clean['rating_numerator'] / df_master_clean['rating_denominator']
```
<br><br>

11. Corrected the dog breeds starting with lower case characters in columns 'p1', 'p2', 'p3' in 'df_image_predictions' and replaced the underscore with a space character.
```
    df_master_clean['p1'] = df_master_clean['p1'].str.title()
    df_master_clean['p2'] = df_master_clean['p2'].str.title()
    df_master_clean['p3'] = df_master_clean['p3'].str.title()

    df_master_clean['p1'] = df_master_clean['p1'].str.replace('_', ' ')
    df_master_clean['p2'] = df_master_clean['p2'].str.replace('_', ' ')
    df_master_clean['p3'] = df_master_clean['p3'].str.replace('_', ' ')
```

### Storing the cleaned data for applying analytics and visualizations

After assessing and cleaning the dataset, we usually store the clean data for applying applying analytics and visualizations. Code snippet provided below.

```
    df_master_clean.to_csv('twitter_archive_master.csv', index=False, encoding='utf-8')
    df_master = pd.read_csv('twitter_archive_master.csv', encoding='utf-8')
```

### Conclusion

The most dominant realization while working on this data wrangling project was that most of the world's data is dirty or messy and not clean. Hence data wrangling is fundamental operation being performed by 'Data Analysts' and 'Data Scientists'. In fact, major portion of their time is spent on cleaning the data and making it suitable for applying analysis or running machine learning algorithms. Data wrangling also helps in further planning and understanding of the data set and also helps in optimizing the data cleaning process. Data wrangling is a fundamental process before applying machine learning algorithms or analytics and cannot be skipped as it will lead to inaccurate and unreliable predictions. Your predictions or visualizations can only be as good as your data.