# Citation Counts and impact factors

Dataset from [https://datadryad.org/stash/dataset/doi:10.5061/dryad.2h4j5](https://datadryad.org/stash/dataset/doi:10.5061/dryad.2h4j5) article is [https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001675](https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001675)

## Step 1

- Download Text file from [https://datadryad.org/stash/downloads/file_stream/30779](https://datadryad.org/stash/downloads/file_stream/30779)

- Copy into working directory

## Step 2

- open with Pandas and Normalize

In [144]:
import pandas


In [182]:
#read in our text file, anything that is blank will be treated as a Null Value
citation_data = pandas.read_csv('F1000 data - Dryad.txt', sep="\t", skipinitialspace = True)
#no need to modify columns!
citation_data

Unnamed: 0,Journal,Score1,Score2,IF2,IF5,NoOfScores,Citations
0,Immunity,8,,24.221,22.133,1,220
1,Journal of the American Chemical Society,6,6.0,9.023,8.981,2,40
2,Science (New York),8,10.0,31.377,31.777,2,1003
3,Gastroenterology,6,8.0,12.032,12.403,2,85
4,Nature Medicine,10,,25.430,27.887,1,213
...,...,...,...,...,...,...,...
5806,Cerebral Cortex,6,,6.844,7.200,1,12
5807,The Journal of Biological Chemistry,6,,5.328,5.498,1,108
5808,Molecular Biology and Evolution,6,,5.510,8.907,1,37
5809,The Journal of Biological Chemistry,8,,5.328,5.498,1,10


In [183]:
#we drop rows with any missing values
citation_data = citation_data.dropna(how='any')
citation_data

Unnamed: 0,Journal,Score1,Score2,IF2,IF5,NoOfScores,Citations
1,Journal of the American Chemical Society,6,6.0,9.023,8.981,2,40
2,Science (New York),8,10.0,31.377,31.777,2,1003
3,Gastroenterology,6,8.0,12.032,12.403,2,85
5,Neuron,8,8.0,14.027,14.927,2,336
6,The Journal of Cell Biology,6,6.0,9.921,10.123,2,177
...,...,...,...,...,...,...,...
5753,Nature Immunology,10,8.0,25.668,25.934,2,125
5774,Molecular and Cellular Biology,8,8.0,6.188,6.381,2,73
5781,RNA (New York),6,6.0,6.051,5.486,2,20
5782,Molecular Biology and Evolution,6,8.0,5.510,8.907,2,11


In [184]:
#split into 10 quartiles
citation_data["TopCitation"] = pandas.qcut(citation_data["Citations"],10,labels=False)
citation_data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,Journal,Score1,Score2,IF2,IF5,NoOfScores,Citations,TopCitation
1,Journal of the American Chemical Society,6,6.0,9.023,8.981,2,40,1
2,Science (New York),8,10.0,31.377,31.777,2,1003,9
3,Gastroenterology,6,8.0,12.032,12.403,2,85,3
5,Neuron,8,8.0,14.027,14.927,2,336,8
6,The Journal of Cell Biology,6,6.0,9.921,10.123,2,177,6
...,...,...,...,...,...,...,...,...
5753,Nature Immunology,10,8.0,25.668,25.934,2,125,5
5774,Molecular and Cellular Biology,8,8.0,6.188,6.381,2,73,2
5781,RNA (New York),6,6.0,6.051,5.486,2,20,0
5782,Molecular Biology and Evolution,6,8.0,5.510,8.907,2,11,0


In [186]:

citation_data["TopCitation"].replace(\
                                {1:0,
                                 2:0,
                                 3:0,
                                 4:0,
                                 5:0,
                                 6:0,
                                 7:0,
                                 8:0},inplace=True)
citation_data["TopCitation"].replace({9:1},inplace=True)
citation_data.pop("Citations")
citation_data

Unnamed: 0,Journal,Score1,Score2,IF2,IF5,NoOfScores,TopCitation
1,Journal of the American Chemical Society,6,6.0,9.023,8.981,2,0
2,Science (New York),8,10.0,31.377,31.777,2,1
3,Gastroenterology,6,8.0,12.032,12.403,2,0
5,Neuron,8,8.0,14.027,14.927,2,0
6,The Journal of Cell Biology,6,6.0,9.921,10.123,2,0
...,...,...,...,...,...,...,...
5753,Nature Immunology,10,8.0,25.668,25.934,2,0
5774,Molecular and Cellular Biology,8,8.0,6.188,6.381,2,0
5781,RNA (New York),6,6.0,6.051,5.486,2,0
5782,Molecular Biology and Evolution,6,8.0,5.510,8.907,2,0


In [189]:
citation_data.groupby("TopCitation").count()

Unnamed: 0_level_0,Journal,Score1,Score2,IF2,IF5,NoOfScores
TopCitation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,1197,1197,1197,1197,1197,1197
1,131,131,131,131,131,131


In [None]:
#Scale Citations so that the values don't vary that widely

#from sklearn import preprocessing

#scaler = preprocessing.MinMaxScaler()
#x = citation_data[["Citations"]].values.astype(int)
#citation_data["Citations"] = scaler.fit_transform(x)

#citation_data


In [190]:
# Let's shorted a couple of our labels
# Proceedings of the National Academy of Sciences of the United States of America to Proceedings
# Science (New York) to Science
citation_data["Journal"].mask(citation_data["Journal"]== "Proceedings of the National Academy of Sciences of the United States of America",'Proceedings',inplace=True)
citation_data["Journal"].mask(citation_data["Journal"]== "Science (New York)",'Science',inplace=True)
citation_data.to_csv('week_4_citation_homework.csv',index=False)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


Originally I thought to shorten the dataset to only 10 titles... Don't think I need to do that anymore

In [None]:
#It'll be a bit too much to create dummies for all of these journal!
#citation_data['Journal'].nunique()

In [None]:
#How about just top n journals?
#n=10
#top = citation_data["Journal"].value_counts()[:n].index.tolist()
#top

In [None]:
#citation_data = citation_data[citation_data["Journal"].isin(top)]
#citation_data

In [None]:
# Let's shorted a couple of our labels
# Proceedings of the National Academy of Sciences of the United States of America to Proceedings
# Science (New York) to Science

#citation_data["Journal"].mask(citation_data["Journal"]== "Proceedings of the National Academy of Sciences of the United States of America",'Proceedings',inplace=True)
#citation_data["Journal"].mask(citation_data["Journal"]== "Science (New York)",'Science',inplace=True)

#citation_data

In [129]:
#citation_data.to_csv('week_4_citation_homework.csv',index=False)