# 清理与分析数据

## 收集

#### 导入需要的库

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from IPython.display import display
pd.set_option('display.max_rows',None)
pd.set_option('display.max_columns',None)
pd.set_option('max_colwidth',150)

#### twitter_archive_enhanced 保存为DataFrame。

In [2]:
df_raw = pd.read_csv('twitter-archive-enhanced.txt',sep=',',header=0)

#### image-prediction.tsv 转换为DataFrame

In [3]:
df_img = pd.read_csv('image-predictions.tsv',sep='\t',header=0)

#### tweet_json.txt 转换为DataFrame格式
+ 出现错误：'ValueError: Trailing data'
+ 解决方法：添加参数line=True
+ 参考：https://stackoverflow.com/questions/30088006/loading-a-file-with-more-than-one-line-of-json-into-pandas


In [4]:
df_supl = pd.read_json('tweet_json.txt',lines=True)

## 评估

### 针对twitter-archive-enhanced.csv文件的评估

In [5]:
df_raw.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1322,706291001778950144,,,2016-03-06 01:31:11 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",When you're just relaxin and having a swell time but then remember you have to fill out the FAFSA ...11/10 https://t.co/qy33OBcexg,,,,"https://twitter.com/dog_rates/status/706291001778950144/photo/1,https://twitter.com/dog_rates/status/706291001778950144/photo/1",11,10,,,,,
1779,677716515794329600,,,2015-12-18 05:06:23 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq,,,,https://twitter.com/dog_rates/status/677716515794329600/photo/1,144,120,,,,,
399,824796380199809024,,,2017-01-27 01:49:15 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Bailey. She loves going down slides but is very bad at it. Still 11/10 https://t.co/ivPWhspN3E,7.950767e+17,4196984000.0,2016-11-06 01:33:58 +0000,"https://twitter.com/dog_rates/status/795076730285391872/photo/1,https://twitter.com/dog_rates/status/795076730285391872/photo/1,https://twitter.co...",11,10,Bailey,,,,
2123,670385711116361728,,,2015-11-27 23:36:23 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Meet Larry. He's a Panoramic Benzoate. Can shoot lasers out of his eyes. Very neat. Stuck in that position tho. 8/10 https://t.co/MAZx8MPF0S,,,,https://twitter.com/dog_rates/status/670385711116361728/photo/1,8,10,Larry,,,,
54,881666595344535552,,,2017-07-03 00:11:11 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Gary. He couldn't miss this puppertunity for a selfie. Flawless focusing skills. 13/10 would boop intensely https://t.co/7CSWCl8I6s,,,,https://twitter.com/dog_rates/status/881666595344535552/photo/1,13,10,Gary,,,,


In [6]:
df_raw.shape

(2356, 17)

In [7]:
df_raw.columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

#### 共有17个特征，2356条记录。

+ tweet_id，是每个tweet的唯一标识
+ 'in_reply_to_status_id',Nullable. If the represented Tweet is a reply, this field will contain the integer representation of the original Tweet’s ID
+ 'in_reply_to_user_id', Nullable. If the represented Tweet is a reply, this field will contain the integer representation of the original Tweet’s author ID. This will not necessarily always be the user directly mentioned in the Tweet.
+ 'timestamp',
+ 'source',Utility used to post the Tweet, as an HTML-formatted string. 
+ 'text',The actual UTF-8 text of the status update.tweet中的文本内容
+ 'retweeted_status_id',
+ 'retweeted_status_user_id',
+ 'retweeted_status_timestamp', 
+ 'expanded_urls',
+ 'rating_numerator',评分的分子部分，一般大于10，也可能存在小于10的情况。
+ 'rating_denominator',评分的分母部分，一般为10
+ 'name',来自tweet的文本内容，判断是狗的名字。
+ 'doggo',一种狗的等级，等级一共有四种。
+ 'floofer',一种狗的等级，等级一共有四种。
+ 'pupper',一种狗的等级，等级一共有四种。
+ 'puppo'，一种狗的等级，等级一共有四种。

参考：
+ https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object

In [8]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [9]:
df_raw.isnull().any()

tweet_id                      False
in_reply_to_status_id          True
in_reply_to_user_id            True
timestamp                     False
source                        False
text                          False
retweeted_status_id            True
retweeted_status_user_id       True
retweeted_status_timestamp     True
expanded_urls                  True
rating_numerator              False
rating_denominator            False
name                          False
doggo                         False
floofer                       False
pupper                        False
puppo                         False
dtype: bool

In [10]:
df_raw.duplicated(subset=['tweet_id','name']).sum()

0

In [11]:
df_raw['name'].value_counts().nlargest(20)

None       745
a           55
Charlie     12
Oliver      11
Cooper      11
Lucy        11
Tucker      10
Penny       10
Lola        10
Winston      9
Bo           9
Sadie        8
the          8
Toby         7
Bailey       7
Buddy        7
Daisy        7
an           7
Oscar        6
Jax          6
Name: name, dtype: int64

In [12]:
df_raw[['rating_numerator','rating_denominator']].describe()

Unnamed: 0,rating_numerator,rating_denominator
count,2356.0,2356.0
mean,13.126486,10.455433
std,45.876648,6.745237
min,0.0,0.0
25%,10.0,10.0
50%,11.0,10.0
75%,12.0,10.0
max,1776.0,170.0


In [13]:
df_raw.rating_numerator.value_counts().nlargest(20)

12     558
11     464
10     461
13     351
9      158
8      102
7       55
14      54
5       37
6       32
3       19
4       17
1        9
2        9
420      2
0        2
15       2
75       2
80       1
20       1
Name: rating_numerator, dtype: int64

In [14]:
df_raw.rating_denominator.value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

In [15]:
for i in df_raw.text.values[:5]:
    display(i) 

"This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU"

"This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV"

'This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB'

'This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ'

'This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f'

In [16]:
df_raw.query("name == 'a'").text.values[:5]

array(['Here is a pupper approaching maximum borkdrive. Zooming at never before seen speeds. 14/10 paw-inspiring af \n(IG: puffie_the_chow) https://t.co/ghXBIIeQZF',
       'Here is a perfect example of someone who has their priorities in order. 13/10 for both owner and Forrest https://t.co/LRyMrU7Wfq',
       'Guys this is getting so out of hand. We only rate dogs. This is a Galapagos Speed Panda. Pls only send dogs... 10/10 https://t.co/8lpAGaZRFn',
       'This is a mighty rare blue-tailed hammer sherk. Human almost lost a limb trying to take these. Be careful guys. 8/10 https://t.co/TGenMeXreW',
       'Viewer discretion is advised. This is a terrible attack in progress. Not even in water (tragic af). 4/10 bad sherk https://t.co/L3U0j14N5R'],
      dtype=object)

In [17]:
df_raw[df_raw.rating_numerator==45].text.values

array(['From left to right:\nCletus, Jerome, Alejandro, Burp, &amp; Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK'],
      dtype=object)

In [18]:
df_raw[df_raw.rating_numerator==1776].text.values

array(["This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh"],
      dtype=object)

In [19]:
df_raw[df_raw.rating_numerator==0].text.values[:]

array(["When you're so blinded by your systematic plagiarism that you forget what day it is. 0/10 https://t.co/YbEJPkg4Ag",
       "PUPDATE: can't see any. Even if I could, I couldn't reach them to pet. 0/10 much disappointment https://t.co/c7WXaB2nqX"],
      dtype=object)

In [20]:
df_raw[df_raw.rating_numerator==420].text.values[:1]

array(['@dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research'],
      dtype=object)

In [21]:
df_raw[df_raw.rating_numerator==20].text.values[:1]

array(["I'm aware that I could've said 20/16, but here at WeRateDogs we are very professional. An inconsistent rating scale is simply irresponsible"],
      dtype=object)

In [22]:
df_raw[df_raw.rating_denominator==170].text.values

array(['Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv'],
      dtype=object)

In [23]:
df_raw[(df_raw.in_reply_to_status_id.notnull())&(df_raw.in_reply_to_user_id.notnull())].shape

(78, 17)

In [24]:
df_raw[(df_raw.rating_numerator>20)&(df_raw.rating_denominator>10)].shape

(13, 17)

#### 评估df_raw小结：
+ 'timestamp'，'retweeted_status_timestamp'的数据类型是object
+ 'rating_numerator','rating_denominator',最大值分别为1776和170，分别大于均值13和10。
+  'rating_numerator',存在较大的值，比如45，对应的'rating_denominator'是50，这是5只狗评分的分子和分母。
+ 'in_reply_to_status_id ','in_reply_to_user_id','retweeted_status_id ','retweeted_status_user_id','retweeted_status_timestamp','expanded_urls'存在null
+ 狗狗的等级分布在四个列中。
+ 有些狗没有等级分类。数值是None,不是null
+ 有些狗没有名字，数值是None,而不是null
+ 一些狗的名字是'an','the'或者'a'.
+ timestamp的类型是object

### 针对image-prediction.tsv文件的评估

In [25]:
df_img.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
533,676897532954456065,https://pbs.twimg.com/media/CWTSt0UW4AALMNB.jpg,1,hamster,0.628255,False,guinea_pig,0.318646,False,macaque,0.013058,False
1649,809084759137812480,https://pbs.twimg.com/media/CzpyM41UoAE1b2w.jpg,1,vizsla,0.911412,True,bloodhound,0.017134,True,Labrador_retriever,0.011761,True
785,690248561355657216,https://pbs.twimg.com/media/CZRBZ9mWkAAWblt.jpg,1,motor_scooter,0.38269,False,moped,0.318017,False,pickup,0.040625,False
117,668142349051129856,https://pbs.twimg.com/media/CUW37BzWsAAlJlN.jpg,1,Angora,0.918834,False,hen,0.037793,False,wood_rabbit,0.011015,False
1898,850753642995093505,https://pbs.twimg.com/media/C8576jrW0AEYWFy.jpg,1,pug,0.996952,True,bull_mastiff,0.000996,True,French_bulldog,0.000883,True


In [26]:
df_img.shape

(2075, 12)

In [27]:
df_img.columns

Index(['tweet_id', 'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog', 'p2',
       'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog'],
      dtype='object')

#### 共有12个特征，2075条记录。

+ tweet_id，是每个tweet的唯一标识
+ jpg_url 是预测的图像资源链接
+ img_num 最可信的预测结果对应的图像编号.
+ p1 是算法对推特中图片的一号预测 
+ p1_conf 是算法的一号预测的可信度
+ p1_dog 是一号预测该图片是否属于“狗”（有可能是其他物种，比如熊、马等）,True表示图片属于狗
+ p2 是算法对推特中图片预测的第二种可能性
+ p2_conf 是算法的二号预测的可信度
+ p2_dog 是二号预测该图片是否属于“狗”.
+ p3 图片预测的第三种可能性
+ p3_conf 三号预测的可信度
+ p3_dog 三号预测的图片是否属于狗。

参考：
+ udactiy

In [28]:
df_img.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [29]:
df_img.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [30]:
df_img.isna().any()

tweet_id    False
jpg_url     False
img_num     False
p1          False
p1_conf     False
p1_dog      False
p2          False
p2_conf     False
p2_dog      False
p3          False
p3_conf     False
p3_dog      False
dtype: bool

In [31]:
df_img.duplicated(subset=['tweet_id'],keep="first").sum()

0

In [32]:
df_img[(df_img.p1_conf>=df_img.p2_conf)&(df_img.p2_conf>=df_img.p3_conf)|(df_img.p1_conf>=df_img.p3_conf)].shape

(2075, 12)

In [33]:
df_img.img_num.value_counts()

1    1780
2     198
3      66
4      31
Name: img_num, dtype: int64

In [34]:
len(df_img.query("img_num >1"))

295

In [35]:
df_img.p1.str.istitle().sum()

402

In [36]:
df_img.p1.str.islower().sum()

1135

In [37]:
df_img[df_img.p1.str.istitle()].head(1)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
15,666099513787052032,https://pbs.twimg.com/media/CT51-JJUEAA6hV8.jpg,1,Lhasa,0.58233,True,Shih-Tzu,0.166192,True,Dandie_Dinmont,0.089688,True


In [38]:
df_img[~((df_img.p1.str.istitle())|(df_img.p1.str.islower()))].head(1)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True


#### 评估df_img小结：
+ p2,p2_conf,p3,p3_conf,p2_dog,p3_dog这些列，不需要。
+ img_num 表示对应的图片的编号，不需要这个列。
+ p1，狗的类型，存在大小写不一致的情况。



### 针对tweet_json.txt文件的评估

In [39]:
df_supl.sample(5)

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,id,id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,is_quote_status,lang,place,possibly_sensitive,possibly_sensitive_appealable,quoted_status,quoted_status_id,quoted_status_id_str,retweet_count,retweeted,retweeted_status,source,truncated,user
845,,,2016-08-17 20:27:34,"[0, 113]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'ShopWeRateDogs', 'name': 'The Dog Rates Shop', 'id': 752178371354882049, 'id_st...","{'media': [{'id': 766008557347233792, 'id_str': '766008557347233792', 'indices': [114, 137], 'media_url': 'http://pbs.twimg.com/media/CqFouXOXYAAY...",4120,False,This is Angel. She stole the @ShopWeRateDogs shirt from her owner. Fits pretty well actually. 11/10 would forgive https://t.co/jaivZ1dcUL,,766008592277377025,766008592277377024,,,,,,False,en,,0.0,0.0,,,,568,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WILL RATE', '..."
2007,,,2015-12-03 02:45:32,"[0, 124]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 672245248555409408, 'id_str': '672245248555409408', 'indices': [...","{'media': [{'id': 672245248555409408, 'id_str': '672245248555409408', 'indices': [101, 124], 'media_url': 'http://pbs.twimg.com/media/CVRLfeoW4AA_...",727,False,Meet Snickers. He's adorable. Also comes in t-shirt mode. 12/10 I would aggressively caress Snickers https://t.co/aCRKDaFmVr,,672245253877968896,672245253877968896,,,,,,False,en,,0.0,0.0,,,,169,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WILL RATE', '..."
412,,,2017-01-21 03:29:14,"[0, 133]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'dog_rates', 'name': 'SpookyWeRateDogs™', 'id': 4196983835, 'id_str': '419698383...","{'media': [{'id': 822489041455157248, 'id_str': '822489041455157248', 'indices': [110, 133], 'media_url': 'http://pbs.twimg.com/media/C2oRbOuWEAAb...",0,False,RT @dog_rates: This is Paisley. She really wanted to be president this time. Dreams officially crushed. 13/10 https://t.co/liJGwMp17E,,822647212903690241,822647212903690240,,,,,,False,en,,0.0,0.0,,,,7349,False,"{'created_at': 'Fri Jan 20 17:00:46 +0000 2017', 'id': 822489057087389700, 'id_str': '822489057087389700', 'full_text': 'This is Paisley. She real...","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WILL RATE', '..."
834,,,2016-08-21 23:15:55,"[0, 87]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 767500493799231488, 'id_str': '767500493799231488', 'indices': [...","{'media': [{'id': 767500493799231488, 'id_str': '767500493799231488', 'indices': [88, 111], 'media_url': 'http://pbs.twimg.com/media/Cqa1ofnXEAAG0...",8253,False,This is Louie. He's making quite a h*ckin mess. Doesn't seem to care. 12/10 jubilant af https://t.co/Z2g2YWPzX2,,767500508068192258,767500508068192256,,,,,,False,en,,0.0,0.0,,,,2671,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WILL RATE', '..."
1661,,,2016-01-01 02:29:49,"[0, 139]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 682750531614445569, 'id_str': '682750531614445569', 'indices': [...","{'media': [{'id': 682750531614445569, 'id_str': '682750531614445569', 'indices': [116, 139], 'media_url': 'http://pbs.twimg.com/media/CXmd_bsWkAEE...",1665,False,Meet Taco. He's a speckled Garnier Fructis. Loves to shadow box. Ears out of control. Friend clearly impressed 9/10 https://t.co/85X1GHohFr,,682750546109968385,682750546109968384,,,,,,False,en,,0.0,0.0,,,,491,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WILL RATE', '..."


In [40]:
df_supl.shape

(2352, 31)

In [41]:
df_supl.columns

Index(['contributors', 'coordinates', 'created_at', 'display_text_range',
       'entities', 'extended_entities', 'favorite_count', 'favorited',
       'full_text', 'geo', 'id', 'id_str', 'in_reply_to_screen_name',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status',
       'lang', 'place', 'possibly_sensitive', 'possibly_sensitive_appealable',
       'quoted_status', 'quoted_status_id', 'quoted_status_id_str',
       'retweet_count', 'retweeted', 'retweeted_status', 'source', 'truncated',
       'user'],
      dtype='object')

#### 共有31个特征，2352条记录。

+ favorite_count，tweet被点赞的次数，反映阅读者对此tweet内容是否喜欢。
+ retweet_count，tweet被转发的次数，反映阅读者对此tweet内容喜欢的程度。
+ id,tweet的唯一标识


In [42]:
df_supl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2352 entries, 0 to 2351
Data columns (total 31 columns):
contributors                     0 non-null float64
coordinates                      0 non-null float64
created_at                       2352 non-null datetime64[ns]
display_text_range               2352 non-null object
entities                         2352 non-null object
extended_entities                2073 non-null object
favorite_count                   2352 non-null int64
favorited                        2352 non-null bool
full_text                        2352 non-null object
geo                              0 non-null float64
id                               2352 non-null int64
id_str                           2352 non-null int64
in_reply_to_screen_name          78 non-null object
in_reply_to_status_id            78 non-null float64
in_reply_to_status_id_str        78 non-null float64
in_reply_to_user_id              78 non-null float64
in_reply_to_user_id_str          78 n

In [43]:
df_supl[['favorite_count','retweet_count']].describe()

Unnamed: 0,favorite_count,retweet_count
count,2352.0,2352.0
mean,8109.19898,3134.932398
std,11980.795669,5237.846296
min,0.0,0.0
25%,1417.0,618.0
50%,3596.5,1456.5
75%,10118.0,3628.75
max,132318.0,79116.0


#### 评估df_img小结：
+ 保留favorite_count,retweet_count,id这两个列，其余列不需要。
+ 这个数据集是关于狗的类型，与df_raw都是存储狗这个observation。
+ id的数据类型是int64
+ 考虑是否从entities列中获取id，与现有的id列的数据进行验证



### 评估总结：

#### 质量

#####  **twitter-archive-enhanced数据**
+ 'in_reply_to_status_id','in_reply_to_user_id','retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp','expanded_urls'，'retweeted_status_timestamp',,'source'这些列不需要。
+ 有些狗没有名字，数值是None,'an','the'或者'a'.而不是null
+ 有些狗没有等级分类。数值是None,不是null
+ 'rating_numerator','rating_denominator',最大值分别为1776和170，分别大于均值13和10
+ timestamp的类型是object，tweet_id的类型是int64

##### **image-prediction数据**
+ p1，狗的类型，存在大小写不一致的情况。
+ p2,p2_conf,p3,p3_conf,p2_dog,p3_dog.img_num这些列，不需要。

##### **tweet_json数据**
+ 保留favorite_count,retweet_count,id这三个特征列，其余列不需要。


#### 整洁度
+ df_raw,df_img，df_supl的observation都是dog
+ df_raw中狗的等级分布在四个列中。
+ df_img中使用'p1','p1_dog'两个列表述狗的种类


## 清理

### 清理数据质量问题

#### 备份三个数据集

In [44]:
df_raw_clean = df_raw.copy()
df_img_clean = df_img.copy()
df_supl_clean = df_supl.copy()

#### 清理不需要的特征

##### 定义
+ 从df_raw删除下列不需要的列： 'in_reply_to_status_id','in_reply_to_user_id','retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp','expanded_urls'， 'timestamp'，'retweeted_status_timestamp','source'

+ 从df_img中删除p2,p2_conf,p3,p3_conf,p2_dog,p3_dog.img_num这些列

+ 从df_supl删除除了下列三个列'favorite_count','retweet_count','id'之外的其他列。

##### 代码

In [45]:
df_raw_clean.drop(columns=['in_reply_to_status_id','in_reply_to_user_id','retweeted_status_id','retweeted_status_user_id',
                            'retweeted_status_timestamp','expanded_urls','retweeted_status_timestamp','source'],inplace=True)
df_img_clean.drop(columns=['p2','p2_conf','p3','p3_conf','p2_dog','p3_dog','img_num'],inplace=True)
df_supl_clean = df_supl_clean[['id','favorite_count','retweet_count']]

##### 测试

In [46]:
df_raw_clean.columns

Index(['tweet_id', 'timestamp', 'text', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

In [47]:
df_img_clean.columns

Index(['tweet_id', 'jpg_url', 'p1', 'p1_conf', 'p1_dog'], dtype='object')

In [48]:
df_supl_clean.columns

Index(['id', 'favorite_count', 'retweet_count'], dtype='object')

#### 清理空值

##### 定义
+ replace方法对df_raw中的'name'中的None,'a','an','the'替换为null
+ 使用dropna方法清理空值以及相应的记录。

##### 代码

In [49]:
df_raw_clean.name.replace(['None','a','an','the'],np.nan,inplace=True)
df_raw_clean.dropna(axis='index',inplace=True)

##### 测试

In [50]:
df_raw_clean.name.isnull().sum()

0

#### 清理数据类型

##### 定义
+ 使用astype或者to_datatime方法将df_raw中的'timestamp'的数据类型改为Datetime类型
+ 使用astype方法将df_raw,df_img中的'tweet_id'，以及df_supl中的'id'的数据类型改为str类型

##### 代码

In [51]:
df_raw_clean.timestamp = pd.to_datetime(df_raw_clean.timestamp)
df_raw_clean.tweet_id = df_raw_clean.tweet_id.astype(str)
df_img_clean.tweet_id = df_img_clean.tweet_id.astype(str)
df_supl_clean.id = df_supl_clean.id.astype(str)

##### 测试

In [52]:
df_raw_clean.dtypes

tweet_id                      object
timestamp             datetime64[ns]
text                          object
rating_numerator               int64
rating_denominator             int64
name                          object
doggo                         object
floofer                       object
pupper                        object
puppo                         object
dtype: object

In [53]:
df_img_clean.dtypes

tweet_id     object
jpg_url      object
p1           object
p1_conf     float64
p1_dog         bool
dtype: object

In [54]:
df_supl_clean.dtypes

id                object
favorite_count     int64
retweet_count      int64
dtype: object

#### 清理一致性问题

##### 定义
+ 使用str.title()将df_img中'p1'中狗的类型的格式进行统一化处理。

##### 代码

In [55]:
df_img_clean.p1=df_img_clean.p1.str.title()

##### 测试

In [56]:
df_img_clean.p1.str.istitle().shape[0]== df_img_clean.shape[0]

True

#### 清理异常值问题

##### 定义

+ 过滤掉df_raw中'rating_numerator'以及'rating_denominator'中分子大于20，分母大于10的记录。

##### 代码

In [57]:
df_raw_clean = df_raw_clean[~((df_raw_clean.rating_numerator>20)|(df_raw_clean.rating_denominator>10))]

##### 测试

In [58]:
df_raw_clean.describe()

Unnamed: 0,rating_numerator,rating_denominator
count,1533.0,1533.0
mean,10.804958,10.0
std,1.888311,0.0
min,2.0,10.0
25%,10.0,10.0
50%,11.0,10.0
75%,12.0,10.0
max,14.0,10.0


In [59]:
df_raw_clean[df_raw_clean.rating_denominator==0]

Unnamed: 0,tweet_id,timestamp,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


### 清理数据整洁度

#### 清理多个列表示一个特征的问题

##### 定义
+ 过滤掉df_img文件中'p1_dog'为False的记录

##### 代码

In [74]:
df_img_clean = df_img_clean[df_img_clean.p1_dog]

In [60]:
df_raw_clean.head()

Unnamed: 0,tweet_id,timestamp,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,2017-08-01 16:23:56,This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,13,10,Phineas,,,,
1,892177421306343426,2017-08-01 00:17:27,"This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https:/...",13,10,Tilly,,,,
2,891815181378084864,2017-07-31 00:18:03,This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB,12,10,Archie,,,,
3,891689557279858688,2017-07-30 15:58:51,This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,13,10,Darla,,,,
4,891327558926688256,2017-07-29 16:00:24,"This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https:/...",12,10,Franklin,,,,


##### 测试

In [76]:
df_img_clean.p1_dog.value_counts()

True    1532
Name: p1_dog, dtype: int64

#### 清理多个数据表描述统一个observation的问题

##### 定义

+ 你好

##### 代码

##### 测试

### 存储清理后的主数据集

In [63]:
# 将清理后的主数据集保存为 csv 文件或者 SQLite 数据库文件


## 分析和可视化

> 提示：
- 在 Notebook 使用 pandas 或 SQL 分析主数据集，并生成至少三（3）个独立的结论。
- 在 Notebook 中使用 Python 绘图库或在 Tableau 中至少生成一（1）个可视化图表。


In [64]:
# 分析或可视化代码


与上面分析或可视化相关的见解或说明

In [65]:
# 分析或可视化代码


与上面分析或可视化相关的见解或说明

In [66]:
# 你需要添加更多的 code cell 和 markdown cell 来完成所有分析和可视化


更多说明或总结等

> 提示：在完成 Notebook 的所有内容之后，还需要完成两篇文本和图片组成的 PDF 报告。因为这两篇报告中只是文字和图片，不需要包含代码，你可以使用文字编辑软件，比如 Word 来完成：
- 创建一个 300-600 字的书面报告，命名为 `wrangle_report.pdf`，在该报告中简要描述你的数据整理过程。这份报告可以看作是一份内部文档，供你的团队成员查看交流。
- 创建一个 250 字以上的书面报告，命名为 `act_report.pdf`，在该报告中，你可以与读者交流观点，展示你使用整理过的数据生成的可视化图表。这份报告可以看作是一份外部文档，如博客帖子或杂志文章。