# Filter Collabo Data

Collabo data collected from Scopus have General and Specific Areas tagged for each paper. This notebook selects the papers according to the areas that are under `Computing` for connection with the Philippine Computing Society data.

In [4]:
import pandas as pd
import numpy as np
from pathlib import Path

In [5]:
data = pd.read_csv(Path('raw/collabo-dataset-2014_2018.csv'))
data.head()

Unnamed: 0,Author 1,Author 2,ID 1,ID 2,Affiliation 1,Affiliation 2,Year,DOI,EID,General Areas,Specific Areas,AUID 1,AUID 2
0,Jar Carlo C. Ramirez,Terence P. Tumolva,57201376247,48561665200,University of the Philippines - Diliman,University of the Philippines - Diliman,2018,,2-s2.0-85044499335,"Dentistry (miscellaneous); Surfaces, Coatings ...",DENT; MATE,AUID-57201376247,AUID-48561665200
1,Maica Krizna Areja Gavina,Jerrold M. Tubay,56059419700,55900656800,University of the Philippines - Los Banos,University of the Philippines - Los Banos,2018,,2-s2.0-85046644134,Multidisciplinary,MULT,AUID-56059419700,AUID-55900656800
2,Maica Krizna Areja Gavina,Jomar F. Rabajante,56059419700,56059058900,University of the Philippines - Los Banos,University of the Philippines - Los Banos,2018,,2-s2.0-85046644134,Multidisciplinary,MULT,AUID-56059419700,AUID-56059058900
3,Jerrold M. Tubay,Jomar F. Rabajante,55900656800,56059058900,University of the Philippines - Los Banos,University of the Philippines - Los Banos,2018,,2-s2.0-85046644134,Multidisciplinary,MULT,AUID-55900656800,AUID-56059058900
4,Fe M. Dela Cueva,J. S. Mendoza,35097132700,57202584900,University of the Philippines - Los Banos,University of the Philippines - Los Banos,2018,,2-s2.0-85048825826,Agronomy and Crop Science,AGRI,AUID-35097132700,AUID-57202584900


### List of all areas

To be able to know what areas people have published in, we split the category tags to individual values so we can select only the needed ones. We're only manually selecting from the available fields.

In [6]:
areas = data['General Areas'].drop_duplicates().str.split('; ').tolist()

In [7]:
len(areas)

646

In [8]:
flat_list = [item.strip() for sublist in areas for item in sublist]

In [9]:
final_areas = np.unique(flat_list)
len(final_areas)

238

In [10]:
final_areas

array(['Accounting', 'Acoustics and Ultrasonics', 'Aerospace Engineering',
       'Agricultural and Biological Sciences (all)',
       'Agricultural and Biological Sciences (miscellaneous)',
       'Agronomy and Crop Science', 'Algebra and Number Theory',
       'Analysis', 'Analytical Chemistry', 'Animal Science and Zoology',
       'Anthropology', 'Applied Mathematics',
       'Applied Microbiology and Biotechnology', 'Applied Psychology',
       'Aquatic Science', 'Archeology',
       'Archeology (arts and humanities)', 'Artificial Intelligence',
       'Arts and Humanities (all)', 'Arts and Humanities (miscellaneous)',
       'Astronomy and Astrophysics', 'Atmospheric Science',
       'Atomic and Molecular Physics, and Optics',
       'Automotive Engineering', 'Biochemistry',
       'Biochemistry, Genetics and Molecular Biology (all)',
       'Biochemistry, Genetics and Molecular Biology (miscellaneous)',
       'Bioengineering', 'Biomaterials', 'Biomedical Engineering',
       'Bi

### Computing Areas

From the list above, we now have manually selected the areas we will be considering for comparison. It's currently a naive approach but since the list is not that expansive, this should be fine (for now).

In [11]:
cs_areas = ['Computer Graphics and Computer-Aided Design','Computer Networks and Communications',
            'Computer Science (all)','Computer Science (miscellaneous)','Computer Science Applications',
            'Computer Vision and Pattern Recognition',
            #'Computers in Earth Sciences',
            'Hardware and Architecture','Human-Computer Interaction','Information Systems', 
            'Information Systems and Management','Management Information Systems','Software',
            'Theoretical Computer Science']

Spliting the areas into individual rows for each paper so that we can do comparison with the list above for each paper.

In [12]:
paper_areas = pd.DataFrame(data['General Areas'].str.split('; ').tolist(), 
                           index=data['EID']).stack().reset_index([0])
paper_areas.columns = ['EID', 'Area']
paper_areas.head()

Unnamed: 0,EID,Area
0,2-s2.0-85044499335,Dentistry (miscellaneous)
1,2-s2.0-85044499335,"Surfaces, Coatings and Films"
2,2-s2.0-85044499335,Polymers and Plastics
3,2-s2.0-85044499335,Materials Chemistry
0,2-s2.0-85046644134,Multidisciplinary


Selecting the papers with the areas from the list and dropping any duplicates (since it was from the author edges file...)

In [13]:
cs_papers = paper_areas[paper_areas['Area'].isin(cs_areas)].drop_duplicates().reset_index(drop=True)
cs_papers.head()

Unnamed: 0,EID,Area
0,2-s2.0-85045563943,Theoretical Computer Science
1,2-s2.0-85045563943,Computer Science (all)
2,2-s2.0-84995784254,Computer Science Applications
3,2-s2.0-85045991739,Computer Science Applications
4,2-s2.0-85046751690,Computer Networks and Communications


In [14]:
cs_authors = data[data['EID'].isin(cs_papers['EID'])]
cs_authors.shape

(4284, 13)

In [15]:
cs_authors.head()

Unnamed: 0,Author 1,Author 2,ID 1,ID 2,Affiliation 1,Affiliation 2,Year,DOI,EID,General Areas,Specific Areas,AUID 1,AUID 2
15,Kelvin C. Buño,Francis George C. Cabarle,57200600155,53983599800,University of the Philippines - Diliman,University of the Philippines - Diliman,2018,,2-s2.0-85045563943,Theoretical Computer Science; Computer Science...,MATH; COMP,AUID-57200600155,AUID-53983599800
16,Kelvin C. Buño,Marj Darrel Calabia,57200600155,57201659474,University of the Philippines - Diliman,University of the Philippines - Diliman,2018,,2-s2.0-85045563943,Theoretical Computer Science; Computer Science...,MATH; COMP,AUID-57200600155,AUID-57201659474
17,Kelvin C. Buño,Henry N. Adorna,57200600155,36573213900,University of the Philippines - Diliman,University of the Philippines - Diliman,2018,,2-s2.0-85045563943,Theoretical Computer Science; Computer Science...,MATH; COMP,AUID-57200600155,AUID-36573213900
18,Francis George C. Cabarle,Marj Darrel Calabia,53983599800,57201659474,University of the Philippines - Diliman,University of the Philippines - Diliman,2018,,2-s2.0-85045563943,Theoretical Computer Science; Computer Science...,MATH; COMP,AUID-53983599800,AUID-57201659474
19,Francis George C. Cabarle,Henry N. Adorna,53983599800,36573213900,University of the Philippines - Diliman,University of the Philippines - Diliman,2018,,2-s2.0-85045563943,Theoretical Computer Science; Computer Science...,MATH; COMP,AUID-53983599800,AUID-36573213900


Above is the "final" author edges table for papers with areas published under the computing field.