# Sci Hub Usage Data Prep for Week 3 Worksheet

The [original data set](https://datadryad.org/stash/dataset/doi:10.5061/dryad.q447c) to prepare for analysis for Week 3 actvities. This was from a paper published in [Science](https://doi.org/10.1126/science.352.6285.508)

## Step 1

Downloading complete dataset & unpack
- download Zip file from [https://datadryad.org/stash/dataset/doi:10.5061/dryad.q447c](https://datadryad.org/stash/dataset/doi:10.5061/dryad.q447c)
- unzip
- navigate to scihub_data directory


## Step 2

Search for usage lines that mention Toronto and Montrial in location, dump into new file

`grep --no-filename -E 'Toronto|Montreal' *.tab > week_3_scihub.tab`

Move that file back into the main directory

In [40]:
import pandas
sci_data = pandas.read_csv("week_3_scihub.tab",sep="\t",header=None)
sci_data.columns = ["timestamp","doi_whole","user_id","country","city","coords"]
sci_data

Unnamed: 0,timestamp,doi_whole,user_id,country,city,coords
0,2015-12-01 00:00:45,10.1016/j.memsci.2007.03.046,56ed2b0e0d52f,Canada,Toronto,"43.653226,-79.3831843"
1,2015-12-01 00:01:32,10.1109/PES.2007.385969,56ed2c29a40f2,Canada,Montreal,"45.5016889,-73.567256"
2,2015-12-01 00:02:23,10.1109/PSC.2014.6808094,56ed2c29a40f2,Canada,Montreal,"45.5016889,-73.567256"
3,2015-12-01 00:03:00,10.1109/PSC.2014.6808094,56ed2c29a40f2,Canada,Montreal,"45.5016889,-73.567256"
4,2015-12-01 00:03:01,10.1007/978-3-642-30574-0_46,56ed2c29a40f2,Canada,Montreal,"45.5016889,-73.567256"
...,...,...,...,...,...,...
152977,2015-09-30 22:41:04,10.1111/petr.12212,56ed2b3c9d668,Canada,Toronto,"43.653226,-79.3831843"
152978,2015-09-30 22:45:52,10.1142/S0217984914300014,56ed2b3c0b270,Canada,Montreal,"45.5016889,-73.567256"
152979,2015-09-30 22:52:44,10.1016/S0260-8774(03)00241-3,56ed2b4e34dd1,Canada,Toronto,"43.653226,-79.3831843"
152980,2015-09-30 22:52:46,10.1080/10942910701233389,56ed2b4e34dd1,Canada,Toronto,"43.653226,-79.3831843"


In [41]:
#get rid of columns we don't need
sci_data.pop("coords")
sci_data.pop("country")
sci_data

Unnamed: 0,timestamp,doi_whole,user_id,city
0,2015-12-01 00:00:45,10.1016/j.memsci.2007.03.046,56ed2b0e0d52f,Toronto
1,2015-12-01 00:01:32,10.1109/PES.2007.385969,56ed2c29a40f2,Montreal
2,2015-12-01 00:02:23,10.1109/PSC.2014.6808094,56ed2c29a40f2,Montreal
3,2015-12-01 00:03:00,10.1109/PSC.2014.6808094,56ed2c29a40f2,Montreal
4,2015-12-01 00:03:01,10.1007/978-3-642-30574-0_46,56ed2c29a40f2,Montreal
...,...,...,...,...
152977,2015-09-30 22:41:04,10.1111/petr.12212,56ed2b3c9d668,Toronto
152978,2015-09-30 22:45:52,10.1142/S0217984914300014,56ed2b3c0b270,Montreal
152979,2015-09-30 22:52:44,10.1016/S0260-8774(03)00241-3,56ed2b4e34dd1,Toronto
152980,2015-09-30 22:52:46,10.1080/10942910701233389,56ed2b4e34dd1,Toronto


In [42]:
# split out DOI into all three pieces
sci_data["doi_prefix"] = sci_data.doi_whole.str.split('/',expand=True)[0]
sci_data["doi_suffix"] = sci_data.doi_whole.str.split('/',expand=True)[1]

In [43]:
#rearrange columns
sci_data = sci_data[["timestamp","user_id","city","doi_whole","doi_prefix","doi_suffix"]]
sci_data

Unnamed: 0,timestamp,user_id,city,doi_whole,doi_prefix,doi_suffix
0,2015-12-01 00:00:45,56ed2b0e0d52f,Toronto,10.1016/j.memsci.2007.03.046,10.1016,j.memsci.2007.03.046
1,2015-12-01 00:01:32,56ed2c29a40f2,Montreal,10.1109/PES.2007.385969,10.1109,PES.2007.385969
2,2015-12-01 00:02:23,56ed2c29a40f2,Montreal,10.1109/PSC.2014.6808094,10.1109,PSC.2014.6808094
3,2015-12-01 00:03:00,56ed2c29a40f2,Montreal,10.1109/PSC.2014.6808094,10.1109,PSC.2014.6808094
4,2015-12-01 00:03:01,56ed2c29a40f2,Montreal,10.1007/978-3-642-30574-0_46,10.1007,978-3-642-30574-0_46
...,...,...,...,...,...,...
152977,2015-09-30 22:41:04,56ed2b3c9d668,Toronto,10.1111/petr.12212,10.1111,petr.12212
152978,2015-09-30 22:45:52,56ed2b3c0b270,Montreal,10.1142/S0217984914300014,10.1142,S0217984914300014
152979,2015-09-30 22:52:44,56ed2b4e34dd1,Toronto,10.1016/S0260-8774(03)00241-3,10.1016,S0260-8774(03)00241-3
152980,2015-09-30 22:52:46,56ed2b4e34dd1,Toronto,10.1080/10942910701233389,10.1080,10942910701233389


In [45]:
#write this out as a proper CSV file
sci_data.to_csv('week_3_sci_hub_worksheet.csv',index=False)