## Reporting: wrangle report

### Objectives

Data wrangling is a process that entails gathering, assessing and cleaning data. These form the objectives of the data wrangling process of the project:
* **Gathering** - Collecting and loading data from various endpoints
* **Assessing** - Visually and programmatically looking at the data for various issues
* **Cleaning** - Correcting the issues from the assess stage  

This process is also iterative.

### Step 1: Gather Data

In this step, data from three different points were to be gathered using various methods:

1. `twitter_archive_enhanced.csv` which was manually downloaded and loaded into the notebook as required.
2. `image_predictions.tsv` which was loaded to the notebook programmatically.
3. Retweet count and favorite data accessed from the Twitter API through Tweepy library and saved to `tweet_json.txt` and loaded to the notebook. This was possible through a script that used tweet IDs from archive data to get the data required seemlessly.

### Step 2 and 3: Assessing and Cleaning Data

When assessing data, one faces issues to do with quality and tidiness. This is done through visually or programmatically assessing the data.

<table>
    <tr><th colspan="3"><h5>Quality Issues</h5></th></tr>
    <tr>
        <th>Dataset</th>
        <th>Observation</th>
        <th>Solution</th>
    </tr>
    <tr>
        <td rowspan="8">Archive data</td>
        <td>Data contains replies and retweets instead of orginal tweets</td>
        <td>Removed retweeted and reply tweets and kept original tweets only</td>
    </tr>
    <tr>
        <td>The columns <code>doggo</code>, <code>floofer</code>, <code>pupper</code> and <code>puppo</code> have <code>None</code> representing missing values</td>
        <td>Changed <code>None</code> values to np.nan values</td>
    </tr>
    <tr>
        <td><code>timestamp</code> is object data type instead of datetime</td>
        <td>Data type changed from object to datetime data type</td>
    </tr>
    <tr>
        <td><code>text</code> has links in them</td>
        <td>Removed the links</td>
    </tr>
    <tr>
        <td>The <code>rating_numerator</code> has incorrect values and datatypes</td>
        <td>Extracted the numerator rating values again from the text column</td>
    </tr>
    <tr>
        <td>The <code>rating_denominator</code> has incorrect values</td>
        <td>Extracted the denominator rating values again from the text column</td>
    </tr>
    <tr>
        <td>The <code>source</code> column values are closed within <code>&lt;a&gt;</code> tags</td>
        <td>Extracted the source values again from <code>&lt;a&gt;</code> tags in the source column</td>
    </tr>
    <tr>
        <td>Some columns are not necessary for analysis</td>
        <td>Removed the unnecessary columns</td>
    </tr>
    <tr>
        <td>Image predictions data</td>
        <td>There are duplicated image url's in <code>jpg_url</code></td>
        <td>Removed the duplicated image url rows</td>
    </tr>
    <tr>
    <tr><th colspan="3"><h5>Tidiness Issues</h5></th></tr>
    <tr>
        <th>Dataset</th>
        <th>Observation</th>
        <th>Solution</th>
    </tr>
    <tr>
        <td>Archive data</td>
        <td>The columns <code>doggo</code>, <code>floofer</code>, <code>pupper</code> and <code>puppo</code> should be in one column i.e <code>dog_stage</code></td>
        <td>Melted the four columns into one column</td>
    </tr>
    <tr>
        <td>Image predictions data</td>
        <td>The columns <code>p1</code> <code>p1_conf</code> <code>p1_dog</code> <code>p2</code> <code>p2_conf</code> <code>p2_dog</code> <code>p3</code> <code>p3_conf</code> and <code>p3_dog</code> should be in two columns i.e <code>breed</code> and <code>conf</code></td>
        <td>Picked the greatest true p1 confidence level value and corresponding dog breeds into new columns while dropping these columns</td>
    </tr>
    <tr>
        <td>General</td>
        <td>All datasets should be combined into one dataset</td>
        <td>Merged all datasets into one dataset using tweet ids</td>
    </tr>
    <tr>
</table>

### Results

The result was a final dataset that merged data from the three sources after effectively cleaning the data. This data was stored into a csv file called `twitter_archive_master.csv`.