<a href="https://colab.research.google.com/github/elliecheshire/Stop_and_Search/blob/main/Consulting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [108]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [109]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Data Loading

In [110]:
#Data Loading

articles = pd.read_csv('/content/drive/MyDrive/Group 1/articles.CDKN2A.csv')
authors = pd.read_csv('/content/drive/MyDrive/Group 1/authors.CDKN2A.csv')
paper_counts = pd.read_csv('/content/drive/MyDrive/Group 1/paper_counts.csv')

In [111]:
articles.head()

Unnamed: 0,PMID,Title,Abstract,ISSN,Journal,Location,Year,FirstAuthorForename,FirstAuthorLastname,FirstAuthorInitials,FirstAuthorAffiliation
0,10551774,Transfection of an inducible p16/CDKN2A constr...,Recent studies have shown that methylation of ...,0888-8809,"Molecular endocrinology (Baltimore, Md.)",(13) 1801-10,1999,S J,Frost,SJ,"Centre for Cell and Molecular Medicine, School..."
1,10595918,Malignant transformation of neurofibromas in n...,Patients with neurofibromatosis 1 (NF1) are pr...,0002-9440,The American journal of pathology,(155) 1879-84,1999,G P,Nielsen,GP,Molecular Neuro-Oncology Laboratory and the Ja...
2,10620111,Genotype/phenotype and penetrance studies in m...,Patients with a family history of melanoma are...,0022-202X,The Journal of investigative dermatology,(114) 28-33,2000,J A,Bishop,JA,"ICRF Genetic Epidemiology Laboratory, Leeds, U..."
3,10630172,The genetics of hereditary melanoma and nevi. ...,Although the first English-language report of ...,0008-543X,Cancer,(86) 2464-77,1999,M H,Greene,MH,"Division of Hematology/Oncology, Mayo Clinic S..."
4,10632344,Analysis of oncogene and tumor suppressor gene...,Although common among adult intracranial neopl...,1078-0432,Clinical cancer research : an official journal...,(5) 4085-90,1999,C,Raffel,C,"Department of Neurosurgery, Mayo Clinic and Fo..."


In [112]:
authors.head()

Unnamed: 0,PMID,AuthorN,AuthorForename,AuthorLastname,AuthorInitials,AuthorAffiliation
0,10551774,1,S J,Frost,SJ,"Centre for Cell and Molecular Medicine, School..."
1,10551774,2,D J,Simpson,DJ,
2,10551774,3,R N,Clayton,RN,
3,10551774,4,W E,Farrell,WE,
4,10595918,1,G P,Nielsen,GP,Molecular Neuro-Oncology Laboratory and the Ja...


In [113]:
paper_counts.head()

Unnamed: 0,Year,Count
0,1799,1
1,1801,1
2,1802,1
3,1805,1
4,1866,1


### Data Cleaning

In [114]:
# Data Cleaning: Handle Missing Values and Remove Duplicates
# Handle missing values in `authors_df`
authors['AuthorForename'].fillna('Unknown', inplace=True)
authors['AuthorInitials'].fillna('Unknown', inplace=True)
authors.dropna(subset=['AuthorLastname'], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  authors['AuthorForename'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  authors['AuthorInitials'].fillna('Unknown', inplace=True)


In [115]:
# Handle missing values in `articles_df`
articles.fillna({
    'Location': 'Unknown',
    'FirstAuthorForename': 'Unknown',
    'FirstAuthorLastname': 'Unknown',
    'FirstAuthorInitials': 'Unknown',
    'FirstAuthorAffiliation': 'Unknown'
}, inplace=True)

In [116]:
# Remove duplicates
articles.drop_duplicates(inplace=True)
authors.drop_duplicates(inplace=True)
paper_counts.drop_duplicates(inplace=True)

In [117]:
#checking for missing values
print(articles.isnull().sum())
print(authors.isnull().sum())
print(paper_counts.isnull().sum())

PMID                      0
Title                     0
Abstract                  0
ISSN                      0
Journal                   0
Location                  0
Year                      0
FirstAuthorForename       0
FirstAuthorLastname       0
FirstAuthorInitials       0
FirstAuthorAffiliation    0
dtype: int64
PMID                     0
AuthorN                  0
AuthorForename           0
AuthorLastname           0
AuthorInitials           0
AuthorAffiliation    13123
dtype: int64
Year     0
Count    0
dtype: int64


In [118]:
# Data Manipulation: Merge and Create Unique Identifiers
# Merge `authors_df` and `articles_df` on `PMID`
merged_df = pd.merge(authors, articles, on='PMID', how='inner')

In [119]:
# Create a unique identifier for researchers
merged_df['ResearcherID'] = merged_df['AuthorLastname'] + '_' + merged_df['AuthorInitials']
merged_df.head()

Unnamed: 0,PMID,AuthorN,AuthorForename,AuthorLastname,AuthorInitials,AuthorAffiliation,Title,Abstract,ISSN,Journal,Location,Year,FirstAuthorForename,FirstAuthorLastname,FirstAuthorInitials,FirstAuthorAffiliation,ResearcherID
0,10551774,1,S J,Frost,SJ,"Centre for Cell and Molecular Medicine, School...",Transfection of an inducible p16/CDKN2A constr...,Recent studies have shown that methylation of ...,0888-8809,"Molecular endocrinology (Baltimore, Md.)",(13) 1801-10,1999,S J,Frost,SJ,"Centre for Cell and Molecular Medicine, School...",Frost_SJ
1,10551774,2,D J,Simpson,DJ,,Transfection of an inducible p16/CDKN2A constr...,Recent studies have shown that methylation of ...,0888-8809,"Molecular endocrinology (Baltimore, Md.)",(13) 1801-10,1999,S J,Frost,SJ,"Centre for Cell and Molecular Medicine, School...",Simpson_DJ
2,10551774,3,R N,Clayton,RN,,Transfection of an inducible p16/CDKN2A constr...,Recent studies have shown that methylation of ...,0888-8809,"Molecular endocrinology (Baltimore, Md.)",(13) 1801-10,1999,S J,Frost,SJ,"Centre for Cell and Molecular Medicine, School...",Clayton_RN
3,10551774,4,W E,Farrell,WE,,Transfection of an inducible p16/CDKN2A constr...,Recent studies have shown that methylation of ...,0888-8809,"Molecular endocrinology (Baltimore, Md.)",(13) 1801-10,1999,S J,Frost,SJ,"Centre for Cell and Molecular Medicine, School...",Farrell_WE
4,10595918,1,G P,Nielsen,GP,Molecular Neuro-Oncology Laboratory and the Ja...,Malignant transformation of neurofibromas in n...,Patients with neurofibromatosis 1 (NF1) are pr...,0002-9440,The American journal of pathology,(155) 1879-84,1999,G P,Nielsen,GP,Molecular Neuro-Oncology Laboratory and the Ja...,Nielsen_GP


In [120]:
# create a dataframe with only papers which have 'cancer' in them
cancer_rows = merged_df[merged_df['Abstract'].str.contains('cancer', case=False, na=False)]

print(cancer_rows)

cancer_count = cancer_rows.shape[0]
print(f"Number of rows where 'Abstract' contains the word 'cancer': {cancer_count}")

           PMID  AuthorN AuthorForename AuthorLastname AuthorInitials  \
19     10630172        1            M H         Greene             MH   
27     10644453        1            B G         Beatty             BG   
28     10644453        2              S             Qi              S   
29     10644453        3              M     Pienkowska              M   
30     10644453        4            J A       Herbrick             JA   
...         ...      ...            ...            ...            ...   
35705  38070141        3        Xiaoxue           Yuan              X   
35706  38070141        4         Mengqi            Liu              M   
35707  38070141        5           Ping             Wu              P   
35708  38070141        6             Li          Zhong              L   
35709  38070141        7        Zhiyong           Chen              Z   

                                       AuthorAffiliation  \
19     Division of Hematology/Oncology, Mayo Clinic S...   
27 

### Most active researchers

In [121]:
# Count the number of papers each researcher has been involved in
author_counts = cancer_rows['ResearcherID'].value_counts().reset_index()
author_counts.columns = ['ResearcherID', 'Count']
# Sort into descending order
author_counts = author_counts.sort_values(by='Count', ascending=False)
print(author_counts.head(10))

   ResearcherID  Count
0        Wang_Y     27
1      Fuchs_CS     26
2       Ogino_S     25
3     Hruban_RH     23
4        Wang_J     22
5  Goldstein_AM     21
6       Zhang_Y     20
9    Kirkner_GJ     19
8       Zhang_J     19
7          Li_J     19


### Collaboration Patterns

In [123]:
from itertools import combinations

# Create empty dataframes to ammend
df_coauthored = []
df_no_coauthored = []

# Get the researcherID of the most published authors
top_10_authors = most_published_authors['ResearcherID'][:10]

# Create a set of pairs of researchers to check if they have collaborated
author_pairs = list(combinations(top_10_authors, 2))

# Create a dictionary to store the pair and the papers they've co-authored
coauthored_together = {pair: set() for pair in author_pairs}

paper_to_collaborators = collaboration.set_index('PMID')['Collaborators'].to_dict()

# For each pair of top 10 authors, check if they have co-authored any paper
for paper_id, collaborators in paper_to_collaborators.items():
    collaborators = collaborators.split(',')
    # Convert to a set
    collaborator_set = set(collaborators)
    # Check each pair of top 10 authors
    for author1, author2 in author_pairs:
        # If both authors are in the set of collaborators for this paper, they have co-authored it
        if author1 in collaborator_set and author2 in collaborator_set:
            coauthored_together[(author1, author2)].add(paper_id)

# Print out the pairs and the papers they've co-authored
for (author1, author2), papers in coauthored_together.items():
  # if there are co-authored papers then print the authors and papers
  if len(papers) > 0:
    print(f'{author1} and {author2} have co-authored {len(papers)} papers: {papers}')
    # add to the dataframe
    df_coauthored.append({'Author 1': author1,
                          'Author 2': author2,
                          'Co-authored Papers': list(papers)})
  else:
    # if there are no authored papers
    print(f'{author1} and {author2} have not co-authored any papers.')
    # add to the dataframe
    df_no_coauthored.append({'Author 1': author1,
                          'Author 2': author2,
                          'Co-authored Papers': 0})

df_coauthored = pd.DataFrame(df_coauthored)
df_no_coauthored = pd.DataFrame(df_no_coauthored)


Wang_Y and Fuchs_CS have not co-authored any papers.
Wang_Y and Ogino_S have not co-authored any papers.
Wang_Y and Hruban_RH have co-authored 1 papers: {26658419}
Wang_Y and Wang_J have co-authored 2 papers: {22510280, 29667179}
Wang_Y and Goldstein_AM have not co-authored any papers.
Wang_Y and Zhang_Y have co-authored 2 papers: {22208613, 28926119}
Wang_Y and Kirkner_GJ have not co-authored any papers.
Wang_Y and Zhang_J have co-authored 2 papers: {29667179, 22815924}
Wang_Y and Li_J have not co-authored any papers.
Fuchs_CS and Ogino_S have co-authored 24 papers: {18084616, 17350669, 18204436, 19430421, 19002263, 18084250, 21037082, 19107235, 19789368, 20473920, 18516290, 16850502, 17710160, 17065427, 17372756, 17621591, 16645207, 17086168, 17270239, 16699497, 18366060, 17591929, 17239930, 16820091}
Fuchs_CS and Hruban_RH have co-authored 1 papers: {27197284}
Fuchs_CS and Wang_J have not co-authored any papers.
Fuchs_CS and Goldstein_AM have not co-authored any papers.
Fuchs_CS and

In [124]:
# List of authors who have worked together and the papers they have co-authored
df_coauthored

Unnamed: 0,Author 1,Author 2,Co-authored Papers
0,Wang_Y,Hruban_RH,[26658419]
1,Wang_Y,Wang_J,"[22510280, 29667179]"
2,Wang_Y,Zhang_Y,"[22208613, 28926119]"
3,Wang_Y,Zhang_J,"[29667179, 22815924]"
4,Fuchs_CS,Ogino_S,"[18084616, 17350669, 18204436, 19430421, 19002..."
5,Fuchs_CS,Hruban_RH,[27197284]
6,Fuchs_CS,Kirkner_GJ,"[18084616, 17350669, 18204436, 19002263, 18084..."
7,Ogino_S,Kirkner_GJ,"[18084616, 17350669, 18204436, 19002263, 18084..."
8,Hruban_RH,Wang_J,"[28272465, 21798897]"
9,Hruban_RH,Zhang_J,[21798897]


In [125]:
# The list of authors that has not colloborated with the papers in common
df_no_coauthored

Unnamed: 0,Author 1,Author 2,Co-authored Papers
0,Wang_Y,Fuchs_CS,0
1,Wang_Y,Ogino_S,0
2,Wang_Y,Goldstein_AM,0
3,Wang_Y,Kirkner_GJ,0
4,Wang_Y,Li_J,0
5,Fuchs_CS,Wang_J,0
6,Fuchs_CS,Goldstein_AM,0
7,Fuchs_CS,Zhang_Y,0
8,Fuchs_CS,Zhang_J,0
9,Fuchs_CS,Li_J,0
