This notebook is used to reshape the **paper** and **fos** datasets.

Operations on the **fos** dataset:
- Group ll the fos by *name* and compute the average of their *weights*
- Generate a unique *id* for each fos
- Drop the *paper_id* field

Operations on the **paper** dataset:
- For each paper, add the *ids* of the field of studies coevered by the paper

In [1]:
import pandas as pd

In [2]:
# Config
fos_path = ""
paper_path = ""

In [3]:
df = pd.read_csv(fos_path, encoding="UTF-8")
df

Unnamed: 0,paper_id,fos_name,fos_weight
0,101421652,Decision-making,0.495733
1,101421652,Knowledge management,0.460340
2,101421652,Workload,0.539081
3,101421652,Interface design,0.000000
4,101421652,Information system,0.562489
...,...,...,...
41586,9715319,E-commerce,0.514008
41587,9715319,Public relations,0.439679
41588,9715319,Vendor,0.521475
41589,9715319,Database transaction,0.404049


In [4]:
paper_df = pd.read_csv(paper_path, encoding="UTF-8")
paper_df

Unnamed: 0,paper_id,title,year,n_citation,page_start,page_end,doc_type,publisher,volume,issue,doi,references,abstract,venue_id,authors
0,101421652,The influence of query interface design on dec...,2003,139,397,423,Journal,Society for Information Management and The Man...,27,3,10.2307/30036539,"[1516261653, 1978738035]",Managers in modern organizations are confronte...,57293258,"[1973614237, 2113568395]"
1,1015675232,Research-paper recommender systems: a literatu...,2016,106,305,338,Journal,Springer Berlin Heidelberg,17,4,10.1007/s00799-015-0156-0,"[1971040550, 2012451152, 2019443264, 205011383...","In the last 16 years, more than 200 research a...",110615584,"[2032888927, 72611330, 2135709281, 2063223331]"
2,10311529,Technical Section: Sketch-based modeling: A su...,2009,236,85,103,Journal,"Pergamon Press, Inc.",33,1,10.1016/j.cag.2008.09.013,"[2075597533, 2118304946]",User interfaces in modeling have traditionally...,94821547,"[2142664686, 1821145341, 2105740368, 2120678171]"
3,104754383,Shackled to the status quo: the inhibiting eff...,2012,215,21,42,Journal,Society for Information Management and The Man...,36,1,10.2307/41410404,"[1758167268, 1918570226, 2098685541, 211598118...",Given that adoption of a new system often impl...,57293258,"[2305721212, 270263451]"
4,108157922,The State of the Art in Text Filtering,1997,114,141,178,Journal,Kluwer Academic Publishers,7,3,10.1023/A:1008287121180,[],This paper develops a conceptual framework for...,160628929,[7916806]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8329,86453134,IDDQ testing: A review,1992,169,291,303,Journal,Springer US,3,4,10.1007/BF00135333,[],Quiescent power supply current (I DDQ ) testin...,200807567,"[2112919840, 2145526017, 2599269601, 2306694419]"
8330,90851268,Issues in computational Vickrey auctions,2000,108,107,129,Conference,"M. E. Sharpe, Inc.",4,3,10.1080/10864415.2000.11518374,[],The Vickrey auction has been widely advocated ...,1195874795,[2070867630]
8331,914561379,A Robust and Efficient Video Representation fo...,2016,116,219,238,Journal,Springer US,119,3,10.1007/s11263-015-0846-5,"[1966385142, 1984309565, 2020163092, 206861165...",This paper introduces a state-of-the-art video...,25538012,"[2777199248, 273224254, 2145238836, 2111851554]"
8332,925742927,Sensitivity analysis: A review of recent advances,2016,130,869,887,Journal,Elsevier,248,3,10.1016/j.ejor.2015.06.032,"[1979362125, 2148080122]",The solution of several operations research pr...,103321696,"[2082753280, 2559736284]"


Group the fos by *name* and compute the average of the *weight*:

In [5]:
fos_df = df.groupby("fos_name")["fos_weight"].mean().reset_index()
fos_df

Unnamed: 0,fos_name,fos_weight
0,1-planar graph,0.705774
1,2-opt,0.681086
2,32-bit,0.435294
3,3D computer graphics,0.613358
4,3D modeling,0.503868
...,...,...
6083,sysfs,0.643146
6084,t-closeness,0.620392
6085,tf–idf,0.438651
6086,x86,0.477826


Generate a unique *id* and add it to the Dataframe:

In [6]:
fos_id = [100000 + i for i in range(len(fos_df))]
fos_df["id"] = fos_id
fos_df

Unnamed: 0,fos_name,fos_weight,id
0,1-planar graph,0.705774,100000
1,2-opt,0.681086,100001
2,32-bit,0.435294,100002
3,3D computer graphics,0.613358,100003
4,3D modeling,0.503868,100004
...,...,...,...
6083,sysfs,0.643146,106083
6084,t-closeness,0.620392,106084
6085,tf–idf,0.438651,106085
6086,x86,0.477826,106086


Now we work on the **paper** dataset. We open the original **fos** dataset again for ease of use.

In [7]:
old_fos_df = pd.read_csv(fos_path, encoding="UTF-8")

Merge together the *paper_id* with the new fos *id*:

In [8]:
old_df = pd.merge(old_fos_df[["paper_id", "fos_name"]], fos_df, on="fos_name")
old_df

Unnamed: 0,paper_id,fos_name,fos_weight,id
0,101421652,Decision-making,0.453829,101369
1,1982671202,Decision-making,0.453829,101369
2,2052729728,Decision-making,0.453829,101369
3,2112042732,Decision-making,0.453829,101369
4,2120230944,Decision-making,0.453829,101369
...,...,...,...,...
41586,7126071,MEDCIN,0.538794,103141
41587,85271976,Social cognitive theory,0.523215,105040
41588,90851268,Generalized second-price auction,0.623478,102227
41589,90851268,Proxy bid,0.603282,104328


Group in a single row all *fos* that a paper covers:

In [9]:
old_df = old_df.groupby("paper_id")["id"].apply(list).reset_index()
old_df = old_df.rename({"id": "fos_id"}, axis=1)
old_df

Unnamed: 0,paper_id,fos_id
0,7126071,"[102669, 101294, 103292, 101364, 103141]"
1,9715319,"[101643, 104353, 105871, 101334, 101236]"
2,10311529,"[100970, 100270, 105001, 102262, 102502]"
3,12532419,"[103244, 103247, 102306, 102843, 102815]"
4,30814344,"[102929, 100963, 103181, 101886, 105456]"
...,...,...
8329,2761239369,"[100963, 104749, 100453, 104810, 102688]"
8330,2766000922,"[100963, 102669, 101294, 101322, 104409]"
8331,2768461089,"[101525, 100844, 102882, 103350, 103352]"
8332,2776316078,"[100963, 105571, 102083, 101729, 104303]"


Add such *fos_ids* to the **paper** dataset:

In [10]:
paper_df = pd.merge(paper_df, old_df, how="left", on="paper_id")
paper_df

Unnamed: 0,paper_id,title,year,n_citation,page_start,page_end,doc_type,publisher,volume,issue,doi,references,abstract,venue_id,authors,fos_id
0,101421652,The influence of query interface design on dec...,2003,139,397,423,Journal,Society for Information Management and The Man...,27,3,10.2307/30036539,"[1516261653, 1978738035]",Managers in modern organizations are confronte...,57293258,"[1973614237, 2113568395]","[101369, 102929, 106036, 102766, 102680]"
1,1015675232,Research-paper recommender systems: a literatu...,2016,106,305,338,Journal,Springer Berlin Heidelberg,17,4,10.1007/s00799-015-0156-0,"[1971040550, 2012451152, 2019443264, 205011383...","In the last 16 years, more than 200 research a...",110615584,"[2032888927, 72611330, 2135709281, 2063223331]","[100963, 102661, 101414, 102590, 102669]"
2,10311529,Technical Section: Sketch-based modeling: A su...,2009,236,85,103,Journal,"Pergamon Press, Inc.",33,1,10.1016/j.cag.2008.09.013,"[2075597533, 2118304946]",User interfaces in modeling have traditionally...,94821547,"[2142664686, 1821145341, 2105740368, 2120678171]","[100970, 100270, 105001, 102262, 102502]"
3,104754383,Shackled to the status quo: the inhibiting eff...,2012,215,21,42,Journal,Society for Information Management and The Man...,36,1,10.2307/41410404,"[1758167268, 1918570226, 2098685541, 211598118...",Given that adoption of a new system often impl...,57293258,"[2305721212, 270263451]","[103181, 105367, 102627, 105265, 105266]"
4,108157922,The State of the Art in Text Filtering,1997,114,141,178,Journal,Kluwer Academic Publishers,7,3,10.1023/A:1008287121180,[],This paper develops a conceptual framework for...,160628929,[7916806],"[100963, 102669, 100270, 101294, 103154]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8329,86453134,IDDQ testing: A review,1992,169,291,303,Journal,Springer US,3,4,10.1007/BF00135333,[],Quiescent power supply current (I DDQ ) testin...,200807567,"[2112919840, 2145526017, 2599269601, 2306694419]","[100963, 104584, 101707, 101730, 102721]"
8330,90851268,Issues in computational Vickrey auctions,2000,108,107,129,Conference,"M. E. Sharpe, Inc.",4,3,10.1080/10864415.2000.11518374,[],The Vickrey auction has been widely advocated ...,1195874795,[2070867630],"[100847, 100854, 102227, 104328, 101780]"
8331,914561379,A Robust and Efficient Video Representation fo...,2016,116,219,238,Journal,Springer US,119,3,10.1007/s11263-015-0846-5,"[1966385142, 1984309565, 2020163092, 206861165...",This paper introduces a state-of-the-art video...,25538012,"[2777199248, 273224254, 2145238836, 2111851554]","[100970, 100270, 103154, 104689, 103875]"
8332,925742927,Sensitivity analysis: A review of recent advances,2016,130,869,887,Journal,Elsevier,248,3,10.1016/j.ejor.2015.06.032,"[1979362125, 2148080122]",The solution of several operations research pr...,103321696,"[2082753280, 2559736284]","[101294, 103244, 105842, 105767, 103422]"


Finally, we can store the two datasets in two separate files:

In [11]:
fos_df = fos_df.rename({"fos_weight": "weight", "fos_name": "name"}, axis=1)
fos_df = fos_df[["id", "name", "weight"]]
fos_df.to_csv("fos_spark.csv", encoding="UTF-8", index=False, escapechar="|")

In [12]:
paper_df.to_csv("paper_spark.csv", encoding="UTF-8", index=False, escapechar="|")