# Twitter账户@WeRateDogs犬只分析报告

## 收集

#### 导入需要的库

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from IPython.display import display
pd.set_option('display.max_rows',None)
pd.set_option('display.max_columns',None)
pd.set_option('max_colwidth',150)

#### twitter_archive_enhanced 保存为DataFrame。

In [2]:
df_raw = pd.read_csv('twitter-archive-enhanced.txt',sep=',',header=0)

#### image-prediction.tsv 转换为DataFrame

In [3]:
df_img = pd.read_csv('image-predictions.tsv',sep='\t',header=0)

#### tweet_json.txt 转换为DataFrame格式
+ 出现错误：'ValueError: Trailing data'
+ 解决方法：添加参数line=True
+ 参考：https://stackoverflow.com/questions/30088006/loading-a-file-with-more-than-one-line-of-json-into-pandas


In [4]:
df_supl = pd.read_json('tweet_json.txt',lines=True)

## 评估

### 针对twitter-archive-enhanced.csv文件的评估

In [5]:
df_raw.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
783,775350846108426240,,,2016-09-12 15:10:21 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine - Make a Scene</a>",This is Maximus. A little rain won't stop him. He will persevere. 12/10 innovative af https://t.co/2OmDMAkkou,,,,https://vine.co/v/ijmv0PD0XXD,12,10,Maximus,,,,
1903,674638615994089473,,,2015-12-09 17:15:54 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This pupper is fed up with being tickled. 12/10 I'm currently working on an elaborate heist to steal this dog https://t.co/F33n1hy3LL,,,,https://twitter.com/dog_rates/status/674638615994089473/photo/1,12,10,,,,pupper,
520,810254108431155201,,,2016-12-17 22:43:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Gus. He likes to be close to you, which is good because you want to be close to Gus. 12/10 would boop then pet https://t.co/DrsrQkEfnb",,,,https://twitter.com/dog_rates/status/810254108431155201/photo/1,12,10,Gus,,,,
96,873580283840344065,,,2017-06-10 16:39:04 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","We usually don't rate Deck-bound Saskatoon Black Bears, but this one is h*ckin flawless. Sneaky tongue slip too. 13/10 would hug firmly https://t....",,,,https://twitter.com/dog_rates/status/873580283840344065/photo/1,13,10,,,,,
453,818588835076603904,,,2017-01-09 22:42:41 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Chelsea. She forgot how to dog. 11/10 get it together pupper https://t.co/nBJ5RE4yHb,7.735476e+17,4196984000.0,2016-09-07 15:44:53 +0000,"https://twitter.com/dog_rates/status/773547596996571136/photo/1,https://twitter.com/dog_rates/status/773547596996571136/photo/1",11,10,Chelsea,,,pupper,


In [6]:
df_raw.shape

(2356, 17)

In [7]:
df_raw.columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

#### 共有17个特征，2356条记录。

+ tweet_id，是每个tweet的唯一标识
+ 'in_reply_to_status_id',Nullable. If the represented Tweet is a reply, this field will contain the integer representation of the original Tweet’s ID
+ 'in_reply_to_user_id', Nullable. If the represented Tweet is a reply, this field will contain the integer representation of the original Tweet’s author ID. This will not necessarily always be the user directly mentioned in the Tweet.
+ 'timestamp',
+ 'source',Utility used to post the Tweet, as an HTML-formatted string. 
+ 'text',The actual UTF-8 text of the status update.tweet中的文本内容
+ 'retweeted_status_id',
+ 'retweeted_status_user_id',
+ 'retweeted_status_timestamp', 
+ 'expanded_urls',
+ 'rating_numerator',评分的分子部分，一般大于10，也可能存在小于10的情况。
+ 'rating_denominator',评分的分母部分，一般为10
+ 'name',来自tweet的文本内容，判断是犬只的名字。
+ 'doggo',一种犬只的等级，等级一共有四种。
+ 'floofer',一种犬只的等级，等级一共有四种。
+ 'pupper',一种犬只的等级，等级一共有四种。
+ 'puppo'，一种犬只的等级，等级一共有四种。

参考：
+ https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object

In [8]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [9]:
df_raw.isnull().any()

tweet_id                      False
in_reply_to_status_id          True
in_reply_to_user_id            True
timestamp                     False
source                        False
text                          False
retweeted_status_id            True
retweeted_status_user_id       True
retweeted_status_timestamp     True
expanded_urls                  True
rating_numerator              False
rating_denominator            False
name                          False
doggo                         False
floofer                       False
pupper                        False
puppo                         False
dtype: bool

In [10]:
df_raw.duplicated(subset=['tweet_id','name']).sum()

0

In [11]:
df_raw['name'].value_counts().nlargest(20)

None       745
a           55
Charlie     12
Cooper      11
Lucy        11
Oliver      11
Penny       10
Tucker      10
Lola        10
Winston      9
Bo           9
Sadie        8
the          8
an           7
Toby         7
Bailey       7
Buddy        7
Daisy        7
Scout        6
Oscar        6
Name: name, dtype: int64

In [12]:
df_raw[['rating_numerator','rating_denominator']].describe()

Unnamed: 0,rating_numerator,rating_denominator
count,2356.0,2356.0
mean,13.126486,10.455433
std,45.876648,6.745237
min,0.0,0.0
25%,10.0,10.0
50%,11.0,10.0
75%,12.0,10.0
max,1776.0,170.0


In [13]:
df_raw.rating_numerator.value_counts().nlargest(20)

12     558
11     464
10     461
13     351
9      158
8      102
7       55
14      54
5       37
6       32
3       19
4       17
1        9
2        9
420      2
0        2
15       2
75       2
80       1
20       1
Name: rating_numerator, dtype: int64

In [14]:
df_raw.rating_denominator.value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

In [15]:
for i in df_raw.text.values[:5]:
    display(i) 

"This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU"

"This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV"

'This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB'

'This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ'

'This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f'

In [16]:
df_raw.query("name == 'a'").text.values[:5]

array(['Here is a pupper approaching maximum borkdrive. Zooming at never before seen speeds. 14/10 paw-inspiring af \n(IG: puffie_the_chow) https://t.co/ghXBIIeQZF',
       'Here is a perfect example of someone who has their priorities in order. 13/10 for both owner and Forrest https://t.co/LRyMrU7Wfq',
       'Guys this is getting so out of hand. We only rate dogs. This is a Galapagos Speed Panda. Pls only send dogs... 10/10 https://t.co/8lpAGaZRFn',
       'This is a mighty rare blue-tailed hammer sherk. Human almost lost a limb trying to take these. Be careful guys. 8/10 https://t.co/TGenMeXreW',
       'Viewer discretion is advised. This is a terrible attack in progress. Not even in water (tragic af). 4/10 bad sherk https://t.co/L3U0j14N5R'],
      dtype=object)

In [17]:
df_raw[df_raw.rating_numerator==45].text.values

array(['From left to right:\nCletus, Jerome, Alejandro, Burp, &amp; Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK'],
      dtype=object)

In [18]:
df_raw[df_raw.rating_numerator==1776].text.values

array(["This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh"],
      dtype=object)

In [19]:
df_raw[df_raw.rating_numerator==0].text.values[:]

array(["When you're so blinded by your systematic plagiarism that you forget what day it is. 0/10 https://t.co/YbEJPkg4Ag",
       "PUPDATE: can't see any. Even if I could, I couldn't reach them to pet. 0/10 much disappointment https://t.co/c7WXaB2nqX"],
      dtype=object)

In [20]:
df_raw[df_raw.rating_numerator==420].text.values[:1]

array(['@dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research'],
      dtype=object)

In [21]:
df_raw[df_raw.rating_numerator==20].text.values[:1]

array(["I'm aware that I could've said 20/16, but here at WeRateDogs we are very professional. An inconsistent rating scale is simply irresponsible"],
      dtype=object)

In [22]:
df_raw[df_raw.rating_denominator==170].text.values

array(['Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv'],
      dtype=object)

In [23]:
df_raw[(df_raw.in_reply_to_status_id.notnull())&(df_raw.in_reply_to_user_id.notnull())].shape

(78, 17)

In [24]:
df_raw[(df_raw.rating_numerator>20)&(df_raw.rating_denominator>10)].shape

(13, 17)

#### 评估df_raw小结：
+ 'timestamp'，'retweeted_status_timestamp'的数据类型是object
+ 'rating_numerator','rating_denominator',最大值分别为1776和170，分别大于均值13和10。
+  'rating_numerator',存在较大的值，比如45，对应的'rating_denominator'是50，这是5只犬只评分的分子和分母。
+ 'in_reply_to_status_id ','in_reply_to_user_id','retweeted_status_id ','retweeted_status_user_id','retweeted_status_timestamp','expanded_urls'存在null
+ 犬的等级分布在四个列中。
+ 有些犬只没有等级分类。数值是None,不是null
+ 有些犬只没有名字，数值是None,而不是null
+ 一些犬只的名字是'an','the'或者'a'.
+ timestamp的类型是object

### 针对image-prediction.tsv文件的评估

In [25]:
df_img.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1421,772102971039580160,https://pbs.twimg.com/media/CrcPjh0WcAA_SPT.jpg,1,Pembroke,0.54178,True,Cardigan,0.260504,True,Shetland_sheepdog,0.063703,True
1087,718939241951195136,https://pbs.twimg.com/media/CfovbK4WIAAkTn3.jpg,1,Pembroke,0.766327,True,Cardigan,0.222126,True,toilet_tissue,0.006757,False
1719,819227688460238848,https://pbs.twimg.com/media/C157Oq3WQAEOyHm.jpg,1,Border_terrier,0.482452,True,German_shepherd,0.181082,True,Norwegian_elkhound,0.065257,True
811,692417313023332352,https://pbs.twimg.com/media/CZv13u5WYAA6wQe.jpg,1,bison,0.208922,False,mink,0.169945,False,polecat,0.144494,False
1129,728015554473250816,https://pbs.twimg.com/media/ChpuRyvVAAARMoq.jpg,1,cocker_spaniel,0.384559,True,golden_retriever,0.091661,True,sandbar,0.081799,False


In [26]:
df_img.shape

(2075, 12)

In [27]:
df_img.columns

Index(['tweet_id', 'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog', 'p2',
       'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog'],
      dtype='object')

#### 共有12个特征，2075条记录。

+ tweet_id，是每个tweet的唯一标识
+ jpg_url 是预测的图像资源链接
+ img_num 最可信的预测结果对应的图像编号.
+ p1 是算法对推特中图片的一号预测 
+ p1_conf 是算法的一号预测的可信度
+ p1_dog 是一号预测该图片是否属于“犬”（有可能是其他物种，比如熊、马等）,True表示图片属于犬类
+ p2 是算法对推特中图片预测的第二种可能性
+ p2_conf 是算法的二号预测的可信度
+ p2_dog 是二号预测该图片是否属于“犬”.
+ p3 图片预测的第三种可能性
+ p3_conf 三号预测的可信度
+ p3_dog 三号预测的图片是否属于犬。

参考：
+ udactiy

In [28]:
df_img.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [29]:
df_img.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [30]:
df_img.isna().any()

tweet_id    False
jpg_url     False
img_num     False
p1          False
p1_conf     False
p1_dog      False
p2          False
p2_conf     False
p2_dog      False
p3          False
p3_conf     False
p3_dog      False
dtype: bool

In [31]:
df_img.duplicated(subset=['tweet_id'],keep="first").sum()

0

In [32]:
df_img[(df_img.p1_conf>=df_img.p2_conf)&(df_img.p2_conf>=df_img.p3_conf)|(df_img.p1_conf>=df_img.p3_conf)].shape

(2075, 12)

In [33]:
df_img.img_num.value_counts()

1    1780
2     198
3      66
4      31
Name: img_num, dtype: int64

In [34]:
len(df_img.query("img_num >1"))

295

In [35]:
df_img.p1.str.istitle().sum()

402

In [36]:
df_img.p1.str.islower().sum()

1135

In [37]:
df_img[df_img.p1.str.istitle()].head(1)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
15,666099513787052032,https://pbs.twimg.com/media/CT51-JJUEAA6hV8.jpg,1,Lhasa,0.58233,True,Shih-Tzu,0.166192,True,Dandie_Dinmont,0.089688,True


In [38]:
df_img[~((df_img.p1.str.istitle())|(df_img.p1.str.islower()))].head(1)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True


#### 评估df_img小结：
+ p2,p2_conf,p3,p3_conf,p2_dog,p3_dog这些列，不需要。
+ img_num 表示对应的图片的编号，不需要这个列。
+ p1，犬的类型，存在大小写不一致的情况。



### 针对tweet_json.txt文件的评估

In [39]:
df_supl.sample(5)

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,id,id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,is_quote_status,lang,place,possibly_sensitive,possibly_sensitive_appealable,quoted_status,quoted_status_id,quoted_status_id_str,retweet_count,retweeted,retweeted_status,source,truncated,user
144,,,2017-05-13 19:11:30,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'dog_rates', 'name': 'SpookyWeRateDogs™', 'id': 4196983835, 'id_str': '419698383...",,0,False,RT @dog_rates: Say hello to Quinn. She's quite the goofball. Not even a year old. Confirmed 13/10 but she really needs your help \n\nhttps://…,,863471782782697472,863471782782697472,,,,,,False,en,,,,,,,2664,False,"{'created_at': 'Fri May 12 16:05:02 +0000 2017', 'id': 863062471531167744, 'id_str': '863062471531167744', 'full_text': 'Say hello to Quinn. She's...","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WILL RATE', '..."
527,,,2016-12-12 02:21:26,"[0, 95]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'dog_rates', 'name': 'SpookyWeRateDogs™', 'id': 4196983835, 'id_str': '419698383...","{'media': [{'id': 801167887901724672, 'id_str': '801167887901724672', 'indices': [72, 95], 'media_url': 'http://pbs.twimg.com/media/Cx5R8wPVEAALa9...",0,False,RT @dog_rates: This is Milo. I would do terrible things for Milo. 13/10 https://t.co/R6wJyC2Tey,,808134635716833280,808134635716833280,,,,,,False,en,,0.0,0.0,,,,6925,False,"{'created_at': 'Tue Nov 22 20:58:07 +0000 2016', 'id': 801167903437357056, 'id_str': '801167903437357056', 'full_text': 'This is Milo. I would do ...","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WILL RATE', '..."
1409,,,2016-02-14 16:33:40,"[0, 83]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 698907962606292992, 'id_str': '698907962606292992', 'indices': [...","{'media': [{'id': 698907962606292992, 'id_str': '698907962606292992', 'indices': [60, 83], 'media_url': 'http://pbs.twimg.com/media/CbMFFsnXEAA1mk...",2778,False,This dog is never sure if he's doing the right thing. 10/10 https://t.co/GXq43zFfBu,,698907974262222848,698907974262222848,,,,,,False,en,,0.0,0.0,,,,694,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WILL RATE', '..."
1737,,,2015-12-23 01:37:45,"[0, 73]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 679475944860459008, 'id_str': '679475944860459008', 'indices': [...","{'media': [{'id': 679475944860459008, 'id_str': '679475944860459008', 'indices': [50, 73], 'media_url': 'http://pbs.twimg.com/media/CW37xZbUoAAUXe...",2289,False,This is Evy. She doesn't want to be a Koala. 9/10 https://t.co/VITeF0Kl9L,,679475951516934144,679475951516934144,,,,,,False,en,,0.0,0.0,,,,722,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WILL RATE', '..."
1464,,,2016-02-01 17:11:59,"[0, 137]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 694206563263848448, 'id_str': '694206563263848448', 'indices': [...","{'media': [{'id': 694206563263848448, 'id_str': '694206563263848448', 'indices': [114, 137], 'media_url': 'http://pbs.twimg.com/media/CaJRMPQWIAA1...",4571,False,"""Martha come take a look at this. I'm so fed up with the media's unrealistic portrayal of dogs these days."" 10/10 https://t.co/Sd4qAdSRqI",,694206574471057408,694206574471057408,,,,,,False,en,,0.0,0.0,,,,2286,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rates', 'location': 'MERCH↴ DM DOGS. WE WILL RATE', '..."


In [40]:
df_supl.shape

(2352, 31)

In [41]:
df_supl.columns

Index(['contributors', 'coordinates', 'created_at', 'display_text_range',
       'entities', 'extended_entities', 'favorite_count', 'favorited',
       'full_text', 'geo', 'id', 'id_str', 'in_reply_to_screen_name',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status',
       'lang', 'place', 'possibly_sensitive', 'possibly_sensitive_appealable',
       'quoted_status', 'quoted_status_id', 'quoted_status_id_str',
       'retweet_count', 'retweeted', 'retweeted_status', 'source', 'truncated',
       'user'],
      dtype='object')

#### 共有31个特征，2352条记录。

+ favorite_count，tweet被点赞的次数，反映阅读者对此tweet内容是否喜欢。
+ retweet_count，tweet被转发的次数，反映阅读者对此tweet内容喜欢的程度。
+ id,tweet的唯一标识


In [42]:
df_supl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2352 entries, 0 to 2351
Data columns (total 31 columns):
contributors                     0 non-null float64
coordinates                      0 non-null float64
created_at                       2352 non-null datetime64[ns]
display_text_range               2352 non-null object
entities                         2352 non-null object
extended_entities                2073 non-null object
favorite_count                   2352 non-null int64
favorited                        2352 non-null bool
full_text                        2352 non-null object
geo                              0 non-null float64
id                               2352 non-null int64
id_str                           2352 non-null int64
in_reply_to_screen_name          78 non-null object
in_reply_to_status_id            78 non-null float64
in_reply_to_status_id_str        78 non-null float64
in_reply_to_user_id              78 non-null float64
in_reply_to_user_id_str          78 n

In [43]:
df_supl[['favorite_count','retweet_count']].describe()

Unnamed: 0,favorite_count,retweet_count
count,2352.0,2352.0
mean,8109.19898,3134.932398
std,11980.795669,5237.846296
min,0.0,0.0
25%,1417.0,618.0
50%,3596.5,1456.5
75%,10118.0,3628.75
max,132318.0,79116.0


In [51]:
dict(df_supl.entities.iloc[:1])

{0: {'hashtags': [],
  'symbols': [],
  'user_mentions': [],
  'urls': [],
  'media': [{'id': 892420639486877696,
    'id_str': '892420639486877696',
    'indices': [86, 109],
    'media_url': 'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'media_url_https': 'https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'url': 'https://t.co/MgUWQ76dJU',
    'display_url': 'pic.twitter.com/MgUWQ76dJU',
    'expanded_url': 'https://twitter.com/dog_rates/status/892420643555336193/photo/1',
    'type': 'photo',
    'sizes': {'large': {'w': 540, 'h': 528, 'resize': 'fit'},
     'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
     'small': {'w': 540, 'h': 528, 'resize': 'fit'},
     'medium': {'w': 540, 'h': 528, 'resize': 'fit'}}}]}}

In [52]:
dict(df_supl.extended_entities.iloc[:1])

{0: {'media': [{'id': 892420639486877696,
    'id_str': '892420639486877696',
    'indices': [86, 109],
    'media_url': 'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'media_url_https': 'https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'url': 'https://t.co/MgUWQ76dJU',
    'display_url': 'pic.twitter.com/MgUWQ76dJU',
    'expanded_url': 'https://twitter.com/dog_rates/status/892420643555336193/photo/1',
    'type': 'photo',
    'sizes': {'large': {'w': 540, 'h': 528, 'resize': 'fit'},
     'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
     'small': {'w': 540, 'h': 528, 'resize': 'fit'},
     'medium': {'w': 540, 'h': 528, 'resize': 'fit'}}}]}}

In [56]:
dict(df_supl.user.sample(1))

{351: {'id': 4196983835,
  'id_str': '4196983835',
  'name': 'SpookyWeRateDogs™',
  'screen_name': 'dog_rates',
  'location': 'MERCH↴  DM DOGS. WE WILL RATE',
  'description': 'Only Legit Source for Professional Dog Ratings STORE: @ShopWeRateDogs | IG, FB & SC: WeRateDogs | MOBILE APP: @GoodDogsGame Business: dogratingtwitter@gmail.com',
  'url': 'https://t.co/N7sNNHAEXS',
  'entities': {'url': {'urls': [{'url': 'https://t.co/N7sNNHAEXS',
      'expanded_url': 'http://weratedogs.com',
      'display_url': 'weratedogs.com',
      'indices': [0, 23]}]},
   'description': {'urls': []}},
  'protected': False,
  'followers_count': 3768844,
  'friends_count': 107,
  'listed_count': 3195,
  'created_at': 'Sun Nov 15 21:41:29 +0000 2015',
  'favourites_count': 120161,
  'utc_offset': None,
  'time_zone': None,
  'geo_enabled': True,
  'verified': True,
  'statuses_count': 5749,
  'lang': 'en',
  'contributors_enabled': False,
  'is_translator': False,
  'is_translation_enabled': False,
  'prof

#### 评估df_img小结：
+ 保留favorite_count,retweet_count,id这两个列，其余列不需要。
+ 这个数据集是关于犬的类型，与df_raw都是存储犬只这个observation。
+ id的数据类型是int64
+ 考虑是否从entities列中获取id，与现有的id列的数据进行验证



### 评估总结：

#### 质量

#####  **twitter-archive-enhanced数据**
+ 'in_reply_to_status_id','in_reply_to_user_id','retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp','expanded_urls'，'retweeted_status_timestamp',,'source'这些列不需要。
+ 有些犬没有名字，数值是None,'an','the'或者'a'.而不是null
+ 有些犬没有等级分类。数值是None,不是null
+ 'rating_numerator','rating_denominator',最大值分别为1776和170，分别大于均值13和10
+ timestamp的类型是object，tweet_id的类型是int64

##### **image-prediction数据**
+ p1，犬只的类型，存在大小写不一致的情况。
+ p2,p2_conf,p3,p3_conf,p2_dog,p3_dog.img_num这些列，不需要。

##### **tweet_json数据**
+ 保留favorite_count,retweet_count,id这三个特征列，其余列不需要。


#### 整洁度
+ df_raw,df_img，df_supl的observation都是dog
+ df_raw中犬只的等级分布在四个列中。
+ df_img中使用'p1','p1_dog'两个列表述犬只的种类


## 清理

### 清理数据质量问题

#### 备份三个数据集

In [None]:
df_raw_clean = df_raw.copy()
df_img_clean = df_img.copy()
df_supl_clean = df_supl.copy()

#### 清理不需要的特征

##### 定义
+ 从df_raw删除下列不需要的列： 'in_reply_to_status_id','in_reply_to_user_id','retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp','expanded_urls'， 'timestamp'，'retweeted_status_timestamp','source'

+ 从df_img中删除p2,p2_conf,p3,p3_conf,p2_dog,p3_dog.img_num这些列

+ 从df_supl删除除了下列三个列'favorite_count','retweet_count','id'之外的其他列。

##### 代码

In [None]:
df_raw_clean.drop(columns=['in_reply_to_status_id','in_reply_to_user_id','retweeted_status_id','retweeted_status_user_id',
                            'retweeted_status_timestamp','expanded_urls','retweeted_status_timestamp','source'],inplace=True)
df_img_clean.drop(columns=['p2','p2_conf','p3','p3_conf','p2_dog','p3_dog','img_num'],inplace=True)
df_supl_clean = df_supl_clean[['id','favorite_count','retweet_count']]

##### 测试

In [None]:
df_raw_clean.columns

In [None]:
df_img_clean.columns

In [None]:
df_supl_clean.columns

#### 清理空值

##### 定义
+ replace方法对df_raw中的'name'中的'a','an','the'替换为"None"
+ 使用dropna方法清理空值以及相应的记录。

##### 代码

In [None]:
df_raw_clean.name.replace(['a','an','the'],"None",inplace=True)
df_raw_clean.dropna(axis='index',inplace=True)

##### 测试

In [None]:
df_raw_clean.name.isnull().sum()

#### 清理数据类型

##### 定义
+ 使用astype或者to_datatime方法将df_raw中的'timestamp'的数据类型改为Datetime类型
+ 使用astype方法将df_raw,df_img中的'tweet_id'，以及df_supl中的'id'的数据类型改为str类型

##### 代码

In [None]:
df_raw_clean.timestamp = pd.to_datetime(df_raw_clean.timestamp)
df_raw_clean.tweet_id = df_raw_clean.tweet_id.astype(str)
df_img_clean.tweet_id = df_img_clean.tweet_id.astype(str)
df_supl_clean.id = df_supl_clean.id.astype(str)

##### 测试

In [None]:
df_raw_clean.dtypes

In [None]:
df_img_clean.dtypes

In [None]:
df_supl_clean.dtypes

#### 清理一致性问题

##### 定义
+ 使用str.title()将df_img中'p1'中犬的类型的格式进行统一化处理。

##### 代码

In [None]:
df_img_clean.p1=df_img_clean.p1.str.title()

##### 测试

In [None]:
df_img_clean.p1.str.istitle().shape[0]== df_img_clean.shape[0]

#### 清理异常值问题

##### 定义

+ 过滤掉df_raw中'rating_numerator'以及'rating_denominator'中分子大于20，分母大于10的记录。

##### 代码

In [None]:
df_raw_clean = df_raw_clean[~((df_raw_clean.rating_numerator>20)|(df_raw_clean.rating_denominator>10))]

##### 测试

In [None]:
df_raw_clean.describe()

In [None]:
df_raw_clean[df_raw_clean.rating_denominator==0]

### 清理数据整洁度

#### 清理多个列表示一个特征的问题

##### 定义
+ 过滤掉df_img文件中'p1_dog'为False的记录
+ 创建df_temp，是df_raw的切片，包含'doggo','floofer','pupper','puppo'四个列，并添加'stage'列
+ 使用apply方法将df_raw中'doggo','floofer','pupper','puppo'四个列合并为'stage',
+ 使用drop方法，删除df_raw中的'doggo','floofer','pupper','puppo'四个列，并将df_temp中的'stage'合并到df_raw中。

##### 代码

In [None]:
df_img_clean = df_img_clean[df_img_clean.p1_dog==True]

In [None]:
df_temp= df_raw_clean.loc[:,['doggo','floofer','pupper','puppo']]
df_temp["stage"]=np.nan
stage_list = df_temp.columns[:-1].tolist()
def to_stage(df):
    for c in stage_list:
        if df[c] == c:
            return c
    return "None"
df_temp["stage"]=df_temp.apply(to_stage,axis=1)
df_temp.drop(['doggo','floofer','pupper','puppo'],axis=1,inplace=True)
df_raw_clean = pd.concat([df_raw_clean,df_temp],axis=1)
df_raw_clean.drop(['doggo','floofer','pupper','puppo'],axis=1,inplace=True)

##### 测试

In [None]:
df_img_clean.p1_dog.value_counts()

In [None]:
df_raw_clean.columns

In [None]:
df_raw_clean.sample(5)

#### 清理多个数据表描述统一个observation的问题

##### 定义
+ 使用drop方法清理df_img中不必要的列:'p1_dog','p1_conf'
+ 使用rename方法，将df_supl中的'id'改为'tweet_id'
+ 使用merge函数，将df_raw，df_supl,df_img,三个数据集，按照'tweed_id'进行左连接。
+ 使用rename方法，将df_img中的'p1'改为'dog_type'

##### 代码

In [None]:
df_img_clean.drop(columns=['p1_dog','p1_conf'],inplace=True)
df_supl_clean.rename(columns={'id':'tweet_id'},inplace=True)

In [None]:
df_archive_master = pd.merge(df_raw_clean,df_supl_clean,on='tweet_id',how='left')
df_archive_master = pd.merge(df_archive_master,df_img_clean,on='tweet_id',how='left')

In [None]:
df_archive_master.rename(columns={'p1':'dog_type'},inplace=True)

##### 测试

In [None]:
df_archive_master.head(5)

In [None]:
df_archive_master.tail(5)

In [None]:
df_archive_master.info()

### 存储清理后的主数据集

In [None]:
# 将清理后的主数据集保存为 csv 文件
df_archive_master.to_csv('twitter_archive_master.csv',encoding='utf-8')

## 分析和可视化

#### 读取存储数据集

In [None]:
df_visual= pd.read_csv('twitter_archive_master.csv',header=0,index_col =0)

In [None]:
plt.figure(figsize=(10,7))
dog_type_rating_mean = df_visual.groupby('dog_type')['rating_numerator'].mean().sort_values(ascending=False)
dog_type_rating_mean.iloc[:10].plot(kind ="barh",alpha=0.7)
plt.xlabel('mean of rating')
plt.ylabel("type of dog")
plt.title("top 10 most mean of rating on type of dog".title())
plt.show()

In [None]:
dog_type_rating_mean.iloc[:10]

>观点：
+ "Saluki"类犬平均得分最高,平均得分12.50
+ 其次是"Tibetan_Mastiff"和'Briand'类犬.分别是12.40和12.33

In [None]:
plt.figure(figsize=(10,7))
stage_rating_mean = df_visual[df_visual.stage!='None'].groupby('stage')['rating_numerator'].mean().sort_values(ascending=False)
stage_rating_mean.plot(kind ="bar",alpha=0.7)
plt.ylabel('mean of rating')
plt.xlabel("stage of dog")
plt.title("The mean of rating on stage of dog".title())
plt.show()

In [None]:
stage_rating_mean

>观点：
+ 地位为'puppp'的犬平均得分最高,平均得分12.10
+ 地位为'pupper'的犬平均得分最低，平均得分为 10.78

In [None]:
plt.figure(figsize=(10,7))
dog_type_favorite_mean = df_visual[df_visual.dog_type!='None'].groupby('dog_type')['favorite_count'].mean().sort_values(ascending=False)
dog_type_favorite_mean.iloc[:10].plot(kind ="bar",alpha=0.7)
plt.xlabel('type of dog')
plt.ylabel("favorite counts")
plt.title("Top 10 most mean of favorite count for type of dog".title())
plt.show()

In [None]:
dog_type_favorite_mean.iloc[:10]

>观点：
+ "Saluki"类犬平均点赞数最高,平均为23996次
+ 其次是"French_Bulldog "和'Black-And-Tan_Coonhound '类犬.分别是17854和16952次

In [None]:
plt.figure(figsize=(10,7))
dog_type_retweet_mean = df_visual[df_visual.dog_type!='None'].groupby('dog_type')['retweet_count'].mean().sort_values(ascending=False)
dog_type_retweet_mean.iloc[:10].plot(kind ="bar",alpha=0.7)
plt.xlabel("type of dog")
plt.ylabel('retweet counts')
plt.title("Top 10 most mean of retweet count on type of dog".title())
plt.show()

In [None]:
dog_type_retweet_mean.iloc[:10]

>观点：
+ 'Standard_Poodle'类犬平均转发数最高,平均为11033次
+ 其次是"Afghan_Hound"和'English_Springer'类犬.分别是7127和5568次
+ 'Saluki'类犬虽然平均点赞次数最多，但是转发次数只能屈居第五位。

### 总结和展望
+ 本报告主要着眼与数据集关于犬只评分，以及相关点赞和转发等特征的分析和可视化。发掘该@WeRateDog账户中，评分较高的犬只类型，犬只地位，以及点赞数较高，转发数较高的相关犬只类型。通过可视化分析，我们发现：
    1. "Saluki"类犬平均得分最高,平均得分12.50，其次是"Tibetan_Mastiff"和'Briand'类犬.分别是12.40和12.33，
    2. 地位为'puppp'的犬平均得分最高,平均得分12.10，地位为'pupper'的犬平均得分最低，平均得分为 10.78,
    3. "Saluki"类犬平均点赞数最高,平均为23996次,其次是"French_Bulldog "和'Black-And-Tan_Coonhound '类犬.分别是17854和16952次
    4. 'Standard_Poodle'类犬平均转发数最高,平均为11033次,其次是"Afghan_Hound"和'English_Springer'类犬.分别是7127和5568次,'Saluki'类犬虽然平均点赞次数最多，但是转发次数只能屈居第五位。
    
+ 当然，以上的结论存在局限性，首先数据集不够大，只有二千余条数据；其次，收集数据主要是从2014年至2017年，不能反映现在的实际情况；第三，数据的特征不足，缺少点赞或者转发账户信息，同时tweet文本信息过于程式化，没有收集到评论信息，并且没有对这些信息进行情感分析。这些都是需要通过加强相关数据的收集，以及借助相关模型和工具来进一步解决的。
+ 局限性还体现在本报告的关注点上，没有关注twitter账户本身的特征，比如与点赞数和转发数相关的特征，评分的相关的特征分析等等。这些都是需要后续来进行研究和解决的内容。