In [1]:
import pandas as pd

In [47]:
data = pd.read_csv('Collected Data/scraped.csv')
data.head()

Unnamed: 0,ID,Sales Rank,Date,Review Score
0,545790352,118,"October 6, 2015",4.77
1,1419717014,399,"November 3, 2015",4.8
2,1423160916,9637,"October 6, 2015",4.6
3,1476789886,5439,"October 27, 2015",4.9
4,1338029991,196,"November 10, 2015",4.61


Drop the unreachable webpages:

In [48]:
data[data['Sales Rank'] == 'Error 404']

Unnamed: 0,ID,Sales Rank,Date,Review Score
237,1507745923,Error 404,Error 404,Error 404
510,1423160657,Error 404,Error 404,Error 404
2566,151206212X,Error 404,Error 404,Error 404
3738,0375848134,Error 404,Error 404,Error 404
4119,1846432065,Error 404,Error 404,Error 404
4590,1494431726,Error 404,Error 404,Error 404


In [49]:
data = data[data['Sales Rank'] != 'Error 404']

Let's see which of the rows have some missing data:

In [69]:
has_nans = data[data['Sales Rank'].isna() | data['Date'].isna() | data['Review Score'].isna()]
has_nans.head()

Unnamed: 0,ID,Sales Rank,Date,Review Score
43,0545703301,,,4.23
163,0545561639,,,4.29
175,1570548307,,,4.77
198,054549284X,,,4.68
203,0545561663,,,4.34


In [70]:
has_nans.shape

(873, 4)

The rows where all three columns are missing are most likely the ones that the scraping algorithm got blocked on.

In [71]:
blocked = data[data['Sales Rank'].isna() & data['Date'].isna() & data['Review Score'].isna()]
blocked.head()

Unnamed: 0,ID,Sales Rank,Date,Review Score
835,62233009,,,
839,753456095,,,
843,439903742,,,
844,399256059,,,
847,1770496459,,,


In [53]:
blocked.shape

(855, 4)

Seems most of the missing data is because of the blocks, let's drop them for now; we can fill them in later:

In [54]:
has_nans.drop(blocked.index)

Unnamed: 0,ID,Sales Rank,Date,Review Score
43,0545703301,,,4.23
163,0545561639,,,4.29
175,1570548307,,,4.77
198,054549284X,,,4.68
203,0545561663,,,4.34
221,0545459907,,,4.61
233,159174802X,,,4.67
300,0060245867,24.0,,4.7
434,B005SN42LM,4793006.0,,4.5
2046,0807588997,,,4.59


Look at some statistics from the clean data:

In [72]:
clean = data.dropna()
clean.describe()

Unnamed: 0,ID,Sales Rank,Date,Review Score
count,3786,3786,3786,3786.0
unique,3786,3758,1630,164.0
top,62304089,479,"August 25, 2015",5.0
freq,1,3,32,203.0


The duplicated sales ranks are interesting, we should investigate. For example, let's look at the top one:

In [74]:
clean[clean['Sales Rank'] == '479']

Unnamed: 0,ID,Sales Rank,Date,Review Score
28,763644765,479,"September 14, 2010",4.6
126,1554537045,479,"April 1, 2014",4.61
3950,786807601,479,"March 5, 2001",4.83
