# Wrangle and Analyze data

# Table of Contents

* [Introduction](#introduction)
* [Data Wrangling](#data wrangling)
    * [Gathering](#gathering)
    * [Assessing](#assessing)
    * [Cleaning](#cleaning)
* [Storing](#storing)
* [Exploratory Data Analysis](#exploratory data analysis)   
* [Conclusion](#conclusion)
* [References](#references)




## Introduction

> Real-world data rarely comes clean. Using Python and its libraries, I was able to gather data from a variety of sources and in a variety of formats. I then assesed their quality and tidiness, then cleaned them. 

### Softwares needed:
> The following packages (libraries) needed to be installed. You can install these packages via conda or pip or using the import statements in jupyter notebook. 
* pandas
* NumPy
* requests
* tweepy
* json

> You can use a a text editor, like VS Code or Atom but I preferred to keep notes in jupyter.

> A terminal application (Terminal on Mac and Linux or Cygwin on Windows).

### Background

> I gathered three pieces of data and will go further into detail through this Jupyter Notebook titled `wrangle_act.ipynb`. The WeRateDogs Twitter archive was provided to as part of this project by Udacity. This was downloaded manually by clicking the following link: `twitter_archive_enhanced.csv`.

> The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. File(`image_predictions.tsv`) was also provided by Udacity and was downloaded from their servers programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

> I used the tweet IDs in the WeRateDogs Twitter archive to query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called `tweet_json.txt` file. 


> Ultimately my aim in undertaking this project was to wrangle this Twitter data in order to create interesting and trustworthy analyses and visualizations.This included making an extra effort in gathering more data in order to get the best analysis and visualizations.

### Let's get started!
> I started out by setting up the import statements for all of the packages I planned on using.

In [1]:
# import library for data manipulation and analysis
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sn
# import library for making HTTP requests
import requests
import json
# import os(operating system) library to store file
import os

# 'magic word' so that your visualizations are plotted
%matplotlib inline
print("Set up complete")

Set up complete


## Gathering the Data
Data was gathered from three data sources.
1. Download and load csv file data manually into DataFrame.
2. Download and load tweet image predictions programatically into DataFrame.
3. Webscraping data using API

###  Firstly, download and load csv file data manually into DataFrame.
> The WeRateDogs Twitter archive provided by Udacity. File `twitter_archive_enhanced.csv`downloaded manually and then loaded and opened into pandas DataFrame `arch_df`.

In [2]:
# load in the twitter archived data for WeRateDogs into dataframe
arch_df = pd.read_csv('twitter-archive-enhanced.csv')
arch_df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [3]:
# raw datset summary that displays missing values in each column
arch_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

### Secondly, download and load tweet image predictions programatically into DataFrame.
> File `image_predictions.tsv` is hosted on Udacity's servers and was downloaded programmatically using the `requests` library and the following URL:https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

In [4]:
# `requests` module allows you to send HTTP requests
import requests

# url of image prediction tsv file
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"

# send HTTP request to URL and save the response from server in a response object called r.
r = requests.get(url) 
with open(url.split('/')[-1], mode = 'wb') as file:
    file.write(r.content)

# Import data from the tsv file to a pandas DataFrame `img_df`.
img_df = pd.read_csv('image-predictions.tsv', sep = ('\t'))
# print dataframe
img_df.head()    

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [5]:
img_df.tail()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True
2074,892420643555336193,https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg,1,orange,0.097049,False,bagel,0.085851,False,banana,0.07611,False


Output: a table full of image predictions alongside each tweet ID, image URL, and the image number.

#### Column Descriptions for the tweet image prediction table
* tweet_id is the last part of the tweet URL after "status/" → https://twitter.com/dog_rates/status/889531135344209921

* p1 is the algorithm's #1 prediction for the image in the tweet → golden retriever
* p1_conf is how confident the algorithm is in its #1 prediction → 95%
* p1_dog is whether or not the #1 prediction is a breed of dog → TRUE
* p2 is the algorithm's second most likely prediction → Labrador retriever
* p2_conf is how confident the algorithm is in its #2 prediction → 1%
* p2_dog is whether or not the #2 prediction is a breed of dog → TRUE
etc.


### Thirdly, webscraping using Tweepy API
* Query Twitter API for each tweet's JSON data using the tweet IDs in the WeRateDogs Twitter archive.
* Query Twitter API for each tweet's "retweet count" and "favorite ("like") count" at minimum, and any additional data.
* Store each tweet's entire set of JSON data in a file called tweet_json.txt file. 
* Each tweet's JSON data should be written to its own line.
* Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count. 

In [6]:
# used to download from twitter's API
import tweepy
from tweepy import OAuthHandler
import json

# https://github.com/tweepy/tweepy/blob/master/examples/oauth.py
# == OAuth Authentication ==
#
# This mode of authentication is the new preferred way
# of authenticating with Twitter.

# define keys
consumer_key = 'prDTAiUejLjznFcQfPnrsNRx8'
consumer_secret = 'E79WpXjaIAQLbbrRyoDu9i25NILjrDv54pe0x0yJDUXV7osphO'
access_token = '920063514063945728-5nqXAFHdlfnJhk5rrJrA4cclqelSThd'
access_secret = 'uyVQSHbdd5aJfwvkLbNJjBd70S84v3xlVuTvmMstBhKpD'

# `OAuthHandler` instance into which we pass `consumer_key` and `consumer_secret`
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

# store strings values for `access_token` and `access_key` in order to store access token
auth.set_access_token(access_token, access_secret)

# Construct the API instance
# `wait_on_rate_limit_notify` parameter added to decide whether or not to print a notification
# when Tweepy is waiting for rate limits to replenish
api = tweepy.API(auth, wait_on_rate_limit=True, 
                 wait_on_rate_limit_notify = True)


In [7]:
# If the authentication was successful, you should
# see the name of the account print out
print(api.me().name)

Dain Russell


In [23]:
# create list twitter ids
list_of_tweets = list(arch_df['tweet_id'])

* Error handling in Python is done through the use of exceptions that are caught in try blocks and handled in except blocks.
* If an error is encountered, a try block code execution is stopped and transferred down to the except block.

In [26]:
import time

# Tweets that can't be found are saved in this  list
not_found =[]

with open('tweet_json.txt', 'a', encoding='UTF-8') as file:
    start_time = time.process_time()
    
    # For loop which will add each available tweet json to list_of_tw
    for tweet_id in list_of_tweets:
        # try block to test block of code for errors
        try:
            tweet = api.get_status(tweet_id, tweet_mode = 'extended')
            
            # dumping all tweets fetched using API to tweet_json.txt
            json.dump( tweet._json, file)
            file.write('\n')
            print(tweet_id, "No error, writing to memory")
        # `except` block to handle the error
        except:
            not_found.append(tweet_id)
            print(tweet_id, "encountered an error, writing to memory")

time_elapsed = time.process_time() - start_time
print(time_elapsed)

892420643555336193 No error, writing to memory
892177421306343426 No error, writing to memory
891815181378084864 No error, writing to memory
891689557279858688 No error, writing to memory
891327558926688256 No error, writing to memory
891087950875897856 No error, writing to memory
890971913173991426 No error, writing to memory
890729181411237888 No error, writing to memory
890609185150312448 No error, writing to memory
890240255349198849 No error, writing to memory
890006608113172480 No error, writing to memory
889880896479866881 No error, writing to memory
889665388333682689 No error, writing to memory
889638837579907072 No error, writing to memory
889531135344209921 No error, writing to memory
889278841981685760 No error, writing to memory
888917238123831296 No error, writing to memory
888804989199671297 No error, writing to memory
888554962724278272 No error, writing to memory
888202515573088257 encountered an error, writing to memory
888078434458587136 No error, writing to memory
8

858471635011153920 No error, writing to memory
858107933456039936 No error, writing to memory
857989990357356544 No error, writing to memory
857746408056729600 No error, writing to memory
857393404942143489 No error, writing to memory
857263160327368704 No error, writing to memory
857214891891077121 No error, writing to memory
857062103051644929 No error, writing to memory
857029823797047296 No error, writing to memory
856602993587888130 encountered an error, writing to memory
856543823941562368 No error, writing to memory
856526610513747968 No error, writing to memory
856330835276025856 No error, writing to memory
856288084350160898 No error, writing to memory
856282028240666624 No error, writing to memory
855862651834028034 No error, writing to memory
855860136149123072 No error, writing to memory
855857698524602368 No error, writing to memory
855851453814013952 No error, writing to memory
855818117272018944 No error, writing to memory
855459453768019968 No error, writing to memory
8

831926988323639298 No error, writing to memory
831911600680497154 No error, writing to memory
831670449226514432 No error, writing to memory
831650051525054464 No error, writing to memory
831552930092285952 No error, writing to memory
831322785565769729 No error, writing to memory
831315979191906304 No error, writing to memory
831309418084069378 No error, writing to memory
831262627380748289 No error, writing to memory
830956169170665475 No error, writing to memory
830583320585068544 No error, writing to memory
830173239259324417 No error, writing to memory
830097400375152640 No error, writing to memory
829878982036299777 No error, writing to memory
829861396166877184 No error, writing to memory
829501995190984704 No error, writing to memory
829449946868879360 No error, writing to memory
829374341691346946 encountered an error, writing to memory
829141528400556032 No error, writing to memory
829011960981237760 No error, writing to memory
828801551087042563 No error, writing to memory
8

810254108431155201 No error, writing to memory
809920764300447744 No error, writing to memory
809808892968534016 No error, writing to memory
809448704142938112 No error, writing to memory
809220051211603969 No error, writing to memory
809084759137812480 No error, writing to memory
808838249661788160 No error, writing to memory
808733504066486276 No error, writing to memory
808501579447930884 No error, writing to memory
808344865868283904 No error, writing to memory
808134635716833280 No error, writing to memory
808106460588765185 No error, writing to memory
808001312164028416 No error, writing to memory
807621403335917568 No error, writing to memory
807106840509214720 No error, writing to memory
807059379405148160 No error, writing to memory
807010152071229440 No error, writing to memory
806629075125202948 No error, writing to memory
806620845233815552 No error, writing to memory
806576416489959424 No error, writing to memory
806542213899489280 No error, writing to memory
8062428605929

786709082849828864 No error, writing to memory
786664955043049472 No error, writing to memory
786595970293370880 No error, writing to memory
786363235746385920 No error, writing to memory
786286427768250368 No error, writing to memory
786233965241827333 No error, writing to memory
786051337297522688 No error, writing to memory
786036967502913536 No error, writing to memory
785927819176054784 No error, writing to memory
785872687017132033 No error, writing to memory
785639753186217984 No error, writing to memory
785533386513321988 No error, writing to memory
785515384317313025 No error, writing to memory
785264754247995392 No error, writing to memory
785170936622350336 No error, writing to memory
784826020293709826 No error, writing to memory
784517518371221505 No error, writing to memory
784431430411685888 No error, writing to memory
784183165795655680 No error, writing to memory
784057939640352768 No error, writing to memory
783839966405230592 No error, writing to memory
7838211070611

761745352076779520 No error, writing to memory
761672994376806400 No error, writing to memory
761599872357261312 No error, writing to memory
761371037149827077 No error, writing to memory
761334018830917632 No error, writing to memory
761292947749015552 No error, writing to memory
761227390836215808 No error, writing to memory
761004547850530816 No error, writing to memory
760893934457552897 No error, writing to memory
760656994973933572 No error, writing to memory
760641137271070720 No error, writing to memory


Rate limit reached. Sleeping for: 428


760539183865880579 No error, writing to memory
760521673607086080 No error, writing to memory
760290219849637889 No error, writing to memory
760252756032651264 No error, writing to memory
760190180481531904 No error, writing to memory
760153949710192640 No error, writing to memory
759943073749200896 No error, writing to memory
759923798737051648 No error, writing to memory
759846353224826880 No error, writing to memory
759793422261743616 No error, writing to memory
759566828574212096 encountered an error, writing to memory
759557299618865152 No error, writing to memory
759447681597108224 No error, writing to memory
759446261539934208 No error, writing to memory
759197388317847553 No error, writing to memory
759159934323924993 No error, writing to memory
759099523532779520 No error, writing to memory
759047813560868866 No error, writing to memory
758854675097526272 No error, writing to memory
758828659922702336 No error, writing to memory
758740312047005698 No error, writing to memory
7

742423170473463808 No error, writing to memory
742385895052087300 No error, writing to memory
742161199639494656 No error, writing to memory
742150209887731712 No error, writing to memory
741793263812808706 No error, writing to memory
741743634094141440 No error, writing to memory
741438259667034112 No error, writing to memory
741303864243200000 No error, writing to memory
741099773336379392 No error, writing to memory
741067306818797568 No error, writing to memory
740995100998766593 No error, writing to memory
740711788199743490 No error, writing to memory
740699697422163968 No error, writing to memory
740676976021798912 No error, writing to memory
740373189193256964 No error, writing to memory
740365076218183684 No error, writing to memory
740359016048689152 No error, writing to memory
740214038584557568 No error, writing to memory
739979191639244800 No error, writing to memory
739932936087216128 No error, writing to memory
739844404073074688 No error, writing to memory
7396235698193

713761197720473600 No error, writing to memory
713411074226274305 No error, writing to memory
713177543487135744 No error, writing to memory
713175907180089344 No error, writing to memory
712809025985978368 No error, writing to memory
712717840512598017 No error, writing to memory
712668654853337088 No error, writing to memory
712438159032893441 No error, writing to memory
712309440758808576 No error, writing to memory
712097430750289920 No error, writing to memory
712092745624633345 No error, writing to memory
712085617388212225 No error, writing to memory
712065007010385924 No error, writing to memory
711998809858043904 No error, writing to memory
711968124745228288 No error, writing to memory
711743778164514816 No error, writing to memory
711732680602345472 No error, writing to memory
711694788429553666 No error, writing to memory
711652651650457602 No error, writing to memory
711363825979756544 No error, writing to memory
711306686208872448 No error, writing to memory
7110080187758

699370870310113280 No error, writing to memory
699323444782047232 No error, writing to memory
699088579889332224 No error, writing to memory
699079609774645248 No error, writing to memory
699072405256409088 No error, writing to memory
699060279947165696 No error, writing to memory
699036661657767936 No error, writing to memory
698989035503689728 No error, writing to memory
698953797952008193 No error, writing to memory
698907974262222848 No error, writing to memory
698710712454139905 No error, writing to memory
698703483621523456 No error, writing to memory
698635131305795584 No error, writing to memory
698549713696649216 No error, writing to memory
698355670425473025 No error, writing to memory
698342080612007937 No error, writing to memory
698262614669991936 No error, writing to memory
698195409219559425 No error, writing to memory
698178924120031232 No error, writing to memory
697995514407682048 No error, writing to memory
697990423684476929 No error, writing to memory
6979431112013

687312378585812992 No error, writing to memory
687127927494963200 No error, writing to memory
687124485711986689 No error, writing to memory
687109925361856513 No error, writing to memory
687102708889812993 No error, writing to memory
687096057537363968 No error, writing to memory
686947101016735744 No error, writing to memory
686760001961103360 No error, writing to memory
686749460672679938 No error, writing to memory
686730991906516992 No error, writing to memory
686683045143953408 No error, writing to memory
686618349602762752 No error, writing to memory
686606069955735556 No error, writing to memory
686394059078897668 No error, writing to memory
686386521809772549 No error, writing to memory
686377065986265092 No error, writing to memory
686358356425093120 No error, writing to memory
686286779679375361 No error, writing to memory
686050296934563840 No error, writing to memory
686035780142297088 No error, writing to memory
686034024800862208 No error, writing to memory
6860079161308

678798276842360832 No error, writing to memory
678774928607469569 No error, writing to memory
678767140346941444 No error, writing to memory
678764513869611008 No error, writing to memory
678755239630127104 No error, writing to memory
678740035362037760 No error, writing to memory
678708137298427904 No error, writing to memory
678675843183484930 No error, writing to memory
678643457146150913 No error, writing to memory
678446151570427904 No error, writing to memory
678424312106393600 No error, writing to memory
678410210315247616 No error, writing to memory
678399652199309312 No error, writing to memory
678396796259975168 No error, writing to memory
678389028614488064 No error, writing to memory
678380236862578688 No error, writing to memory
678341075375947776 No error, writing to memory
678334497360859136 No error, writing to memory
678278586130948096 No error, writing to memory
678255464182861824 No error, writing to memory
678023323247357953 No error, writing to memory
6780211157180

Rate limit reached. Sleeping for: 590


677700003327029250 No error, writing to memory
677698403548192770 No error, writing to memory
677687604918272002 No error, writing to memory
677673981332312066 No error, writing to memory
677662372920729601 No error, writing to memory
677644091929329666 No error, writing to memory
677573743309385728 No error, writing to memory
677565715327688705 No error, writing to memory
677557565589463040 No error, writing to memory
677547928504967168 No error, writing to memory
677530072887205888 No error, writing to memory
677335745548390400 No error, writing to memory
677334615166730240 No error, writing to memory
677331501395156992 No error, writing to memory
677328882937298944 No error, writing to memory
677314812125323265 No error, writing to memory
677301033169788928 No error, writing to memory
677269281705472000 No error, writing to memory
677228873407442944 No error, writing to memory
677187300187611136 No error, writing to memory
676975532580409345 No error, writing to memory
6769578600860

673636718965334016 No error, writing to memory
673612854080196609 No error, writing to memory
673583129559498752 No error, writing to memory
673580926094458881 No error, writing to memory
673576835670777856 No error, writing to memory
673363615379013632 No error, writing to memory
673359818736984064 No error, writing to memory
673355879178194945 No error, writing to memory
673352124999274496 No error, writing to memory
673350198937153538 No error, writing to memory
673345638550134785 No error, writing to memory
673343217010679808 No error, writing to memory
673342308415348736 No error, writing to memory
673320132811366400 No error, writing to memory
673317986296586240 No error, writing to memory
673295268553605120 No error, writing to memory
673270968295534593 No error, writing to memory
673240798075449344 No error, writing to memory
673213039743795200 No error, writing to memory
673148804208660480 No error, writing to memory
672997845381865473 No error, writing to memory
6729952673193

670093938074779648 No error, writing to memory
670086499208155136 No error, writing to memory
670079681849372674 No error, writing to memory
670073503555706880 No error, writing to memory
670069087419133954 No error, writing to memory
670061506722140161 No error, writing to memory
670055038660800512 No error, writing to memory
670046952931721218 No error, writing to memory
670040295598354432 No error, writing to memory
670037189829525505 No error, writing to memory
670003130994700288 No error, writing to memory
669993076832759809 No error, writing to memory
669972011175813120 No error, writing to memory
669970042633789440 No error, writing to memory
669942763794931712 No error, writing to memory
669926384437997569 No error, writing to memory
669923323644657664 No error, writing to memory
669753178989142016 No error, writing to memory
669749430875258880 No error, writing to memory
669684865554620416 No error, writing to memory
669683899023405056 No error, writing to memory
6696820959844

666837028449972224 No error, writing to memory
666835007768551424 No error, writing to memory
666826780179869698 No error, writing to memory
666817836334096384 No error, writing to memory
666804364988780544 No error, writing to memory
666786068205871104 No error, writing to memory
666781792255496192 No error, writing to memory
666776908487630848 No error, writing to memory
666739327293083650 No error, writing to memory
666701168228331520 No error, writing to memory
666691418707132416 No error, writing to memory
666649482315059201 No error, writing to memory
666644823164719104 No error, writing to memory
666454714377183233 No error, writing to memory
666447344410484738 No error, writing to memory
666437273139982337 No error, writing to memory
666435652385423360 No error, writing to memory
666430724426358785 No error, writing to memory
666428276349472768 No error, writing to memory
666421158376562688 No error, writing to memory
666418789513326592 No error, writing to memory
6664115075514

* Each tweet's JSON data should be written to its own line.
* Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count.

In [29]:
# https://knowledge.udacity.com/questions/304779
# https://stackoverflow.com/questions/12451431/loading-and-parsing-a-json-file-with-multiple-json-objects
# read text file line by line to create dataframe from tweet_json.txt
tweets_data = []
with open('tweet_json.txt') as file:
    for line in file:
        try:
            tweet = json.loads(line)
            tweets_data.append(tweet)
        except:
            continue
df_tweets = pd.DataFrame(tweets_data, columns = list(tweets_data[0].keys()))

df_tweets.head()


Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,place,contributors,is_quote_status,retweet_count,favorite_count,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang
0,Tue Aug 01 16:23:56 +0000 2017,892420643555336193,892420643555336193,This is Phineas. He's a mystical boy. Only eve...,False,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,,,False,7433,35244,False,False,False,False,en
1,Tue Aug 01 00:17:27 +0000 2017,892177421306343426,892177421306343426,This is Tilly. She's just checking pup on you....,False,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,,,False,5527,30512,False,False,False,False,en
2,Mon Jul 31 00:18:03 +0000 2017,891815181378084864,891815181378084864,This is Archie. He is a rare Norwegian Pouncin...,False,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,,,False,3647,22943,False,False,False,False,en
3,Sun Jul 30 15:58:51 +0000 2017,891689557279858688,891689557279858688,This is Darla. She commenced a snooze mid meal...,False,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,,,False,7610,38530,False,False,False,False,en
4,Sat Jul 29 16:00:24 +0000 2017,891327558926688256,891327558926688256,This is Franklin. He would like you to stop ca...,False,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,,,False,8192,36805,False,False,False,False,en


In [36]:
df_tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2350 entries, 0 to 2349
Data columns (total 27 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   created_at                     2350 non-null   object 
 1   id                             2350 non-null   int64  
 2   id_str                         2350 non-null   object 
 3   full_text                      2350 non-null   object 
 4   truncated                      2350 non-null   bool   
 5   display_text_range             2350 non-null   object 
 6   entities                       2350 non-null   object 
 7   extended_entities              2078 non-null   object 
 8   source                         2350 non-null   object 
 9   in_reply_to_status_id          77 non-null     float64
 10  in_reply_to_status_id_str      77 non-null     object 
 11  in_reply_to_user_id            77 non-null     float64
 12  in_reply_to_user_id_str        77 non-null     o

# Assessing the Data

Now that each piece of data has been gathered, it is time to asses, assess them visually and programmatically for quality and tidiness issues.
I detected and doncumented several quality issuesDetect and document at least eight (8) quality  and tidiness issues. 

Key points to keep in mind when data wrangling for this project:

You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
Cleaning includes merging individual pieces of data according to the rules of tidy data.
The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.
You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

In [None]:
arch_df.head

In [None]:
arch_df.info()

Cleaning Data for this Project
Clean each of the issues you documented while assessing. Perform this cleaning in wrangle_act.ipynb as well. The result should be a high quality and tidy master pandas DataFrame (or DataFrames, if appropriate). Again, the issues that satisfy the Project Motivation must be cleaned.

Storing, Analyzing, and Visualizing Data for this Project
Store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. If additional files exist because multiple tables are required for tidiness, name these files appropriately. Additionally, you may store the cleaned data in a SQLite database (which is to be submitted as well if you do).

Analyze and visualize your wrangled data in your wrangle_act.ipynb Jupyter Notebook. At least three (3) insights and one (1) visualization must be produced.

Reporting for this Project
Create a 300-600 word written report called wrangle_report.pdf or wrangle_report.html that briefly describes your wrangling efforts. This is to be framed as an internal document.

Create a 250-word-minimum written report called act_report.pdf or act_report.html that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.

Both of these documents can be created in separate Jupyter Notebooks using the Markdown functionality of Jupyter Notebooks, then downloading those notebooks as PDF files or HTML files (see image below). You might prefer to use a word processor like Google Docs or Microsoft Word, however.

# References

https://www.earthdatascience.org/courses/use-data-open-source-python/intro-to-apis/twitter-data-in-python/

https://knowledge.udacity.com/questions/304779

https://stackoverflow.com/questions/12451431/loading-and-parsing-a-json-file-with-multiple-json-objects

https://github.com/tweepy/tweepy/blob/master/examples/oauth.py

http://docs.tweepy.org/en/latest/getting_started.html

https://stackoverflow.com/questions/21308762/avoid-twitter-api-limitation-with-tweepy

https://stackoverflow.com/questions/47612822/how-to-create-pandas-dataframe-from-twitter-search-api

https://www.pythonforbeginners.com/error-handling/python-try-and-except

https://stackabuse.com/reading-and-writing-json-to-a-file-in-python/

http://docs.tweepy.org/en/v3.5.0/getting_started.html#introduction

https://medium.com/ub-women-data-scholars/let-the-robot-do-your-work-web-scraping-with-python-9c147fb7690f