In [55]:
import pandas as pd
import os

The panel would like you to complete the following task and present it during your interview slot. Attached in the zip file are text files containing metadata from around 5000 journal articles within the food technology sector. Within this data, you will see variables such as 'authors', 'journal', 'year' and 'abstract'. We would like you to import this data, analyse it using your NLP skills, highlight any conclusions or findings you make and display such output on a dashboard. 

Here is some general guidance to help you:
1. We would like to see evidence of your NLP skills, particularly applied to the abstract data
2. The dashboard does not need to be 'polished' and 'neat' as long as it can demonstrate your skill in developing functionality in the dashboard
3. There is a training budget attached to the role, so we are not expecting candidates to be experts in every technical aspect initially.
4. There is no 'right' answer for this task. It is an opportunity for you to not just display your technical skills but to draw whatever conclusions you can


One question has come up and we would like to further clarified.  In your data, 'AB' is the main one to analyse as it contains the contents of the abstracts. This is where their efforts should be focussed. 

Other columns of minor interest are 'SO', which contains the journal name, 'TI', which contains the article title and 'AU', which contains the authors of the article.  

The rest is immaterial for this exercise. The main effort should be spent on the abstracts ('AB').

In [56]:
cols = ['AU','TI','PY','AB']

    PT: Publication Type (Column 1)
    DT: Document Type (Column 2)
    AU: Authors (Column 3)
    AA: Author Affiliations (Column 4)
    ED: Editors (Column 5)
    CA: Conference Name (Column 6)
    SP: Conference Sponsor (Column 7)
    PN: Conference Place (Column 8)
    AE: Author Email (Column 9)
    TI: Title (Column 10)
    FT: File Type (Column 11)
    SO: Source (Column 12)
    LA: Language (Column 13)
    LS: Local Subject (Column 14)
    U1: Unknown (Column 15)
    U2: Unknown (Column 16)
    AB: Abstract (Column 17)
    C1: Author Address (Column 18)
    RI: ResearcherID Numbers (Column 19)
    OI: ORCID Numbers (Column 20)
    PA: Publisher Address (Column 21)
    SC: Subject Category (Column 22)
    PI: Publisher (Column 23)
    SS: Special Issue (Column 24)
    ID: Keywords (Column 25)
    CN: Conference Numbers (Column 26)
    PY: Year (Column 27)
    VL: Volume (Column 28)
    IS: Issue (Column 29)
    BP: Beginning Page (Column 30)
    EP: Ending Page (Column 31)
    SN: ISSN (Column 32)
    BN: ISBN (Column 33)
    NR: Number of References (Column 34)
    PG: Page Count (Column 35)
    DI: DOI (Column 36)
    OA: Open Access (Column 37)
    HC: High-Cited (Column 38)
    HP: Hot Paper (Column 39)
    DA: Date Added to Database (Column 40)
    UT: Unique Article Identifier (Column 41)

In [57]:
data = []

directory = 'data/Interview Data/'  


# Get a list of file names in the directory
file_names = [f for f in os.listdir(directory) if os.path.isfile(os.path.join(directory, f))]

for file_name in file_names:
    file_path = os.path.join(directory, file_name)
    with open(file_path, 'r') as file:
        headers = file.readline().strip().split('\t')
        for line in file:
            values = line.strip().split('\t')
            article_data = {}
            for i in range(len(headers)):
                article_data[headers[i]] = values[i]
            data.append(article_data)



In [58]:
data = []

directory = 'data/Interview Data/'  

file_names = [f for f in os.listdir(directory) if os.path.isfile(os.path.join(directory, f))]

for file_name in file_names:
    file_path = os.path.join(directory, file_name)
    
    with open(file_path, 'r') as file:
        headers = file.readline().strip().split('\t')
        
        for line in file:
            values = line.strip().split('\t')
            article_data = {}
            
            for header, value in zip(headers, values):
                if header not in article_data:
                    article_data[header] = value
                else:
                    article_data[header] += ' ' + value
            
            data.append(article_data)


In [59]:
journal_df = pd.DataFrame(data)
journal_df.head()

Unnamed: 0,﻿PT,DT,AU,AA,ED,CA,SP,PN,AE,TI,...,SN,BN,NR,PG,DI,OA,HC,HP,DA,UT
0,J,,Yu-Xiao Wang; Yue Xin; Jun-Yi Yin; Xiao-Jun Hu...,,,,,,,Revealing the architecture and solution proper...,...,0308-8146,,,,10.1016/j.foodchem.2021.130772,,,,,FSTA:2022-02-Jq1296
1,J,,Yu-Xiao Wang; Yue Xin; Xiao-Jun Huang; Jun-Yi ...,,,,,,,A branched galactoglucan with flexible chains ...,...,0308-8146,,,,10.1016/j.foodchem.2021.130738,,,,,FSTA:2022-01-Aj0385
2,J,,Yu-Xiao Wang; Ting Zhang; Jun-Yi Yin; Xiao-Jun...,,,,,,,Structural characterization and rheological pr...,...,0268-005X,,,,10.1016/j.foodhyd.2021.107475,,,,,FSTA:2022-05-Jq6208
3,J,,Yu-Xin Gu; Tian-Ci Yan; Zi-Xuan Yue; Min-Hui L...,,,,,,,Dispersive micro-solid-phase extraction of aca...,...,1936-976X,,,,10.1007/s12161-021-02209-8,,,,,FSTA:2023-01-Hq0611
4,J,,Yu-Xue Xu; Ze-Dong Jiang; Xi-Ping Du; Ming-Jin...,,,,,,,The identification of biotransformation pathwa...,...,0308-8146,,,,10.1016/j.foodchem.2022.132103,,,,,FSTA:2022-06-Rg2107


In [60]:
# Replace cells containing only empty values or whitespace with NaN
journal_df = journal_df.replace(r'^\s*$', float('nan'), regex=True)
journal_df.tail(3)

Unnamed: 0,﻿PT,DT,AU,AA,ED,CA,SP,PN,AE,TI,...,SN,BN,NR,PG,DI,OA,HC,HP,DA,UT
4997,J,,Dan He; Liping Yan; Jiaqi Zhang; Fang Li; Yu W...,,,,,,,Sargassum fusiforme polysaccharide attenuates ...,...,2048-7177,,,,10.1002/fsn3.2521,,,,,FSTA:2022-02-Rg0589
4998,J,,Dan Hu; Jinyong Wu; Long Jin; Lixia Yuan; Jun ...,,,,,,,Evaluation of Pediococcus pentosaceus strains ...,...,0963-9969,,,,10.1016/j.foodres.2021.110570,,,,,FSTA:2022-01-Jn0072
4999,J,,Dan Huang; Kaiyang Men; Xiaohong Tang; Wei Li;...,,,,,,,Microwave intermittent drying characteristics ...,...,1745-4530,,,,10.1111/jfpe.13608,,,,,FSTA:2021-04-Ne0967


In [61]:
journal_df.shape

(5000, 41)

In [62]:
# create a dataframe of percentage of null values
null_dict = (dict(journal_df.isna().mean().round(4)*100))
null_df = pd.DataFrame.from_dict(null_dict, orient="index").reset_index()
null_df.columns = ['col', 'percentage']
null_df = null_df.sort_values('percentage',ascending=True)
null_df.head(20)

Unnamed: 0,col,percentage
0,﻿PT,0.0
31,SN,0.0
26,PY,0.0
11,SO,0.0
9,TI,0.0
40,UT,0.0
2,AU,0.0
27,VL,0.04
16,AB,0.12
35,DI,1.46


In [63]:
cols = list(null_df[null_df['percentage']<1.0]['col'])
journal_df = journal_df[cols]
journal_df.head()

Unnamed: 0,﻿PT,SN,PY,SO,TI,UT,AU,VL,AB
0,J,0308-8146,2022,Food Chemistry,Revealing the architecture and solution proper...,FSTA:2022-02-Jq1296,Yu-Xiao Wang; Yue Xin; Jun-Yi Yin; Xiao-Jun Hu...,368,Macrolepiota albuminosa (Berk.) Pegler is abun...
1,J,0308-8146,2022,Food Chemistry,A branched galactoglucan with flexible chains ...,FSTA:2022-01-Aj0385,Yu-Xiao Wang; Yue Xin; Xiao-Jun Huang; Jun-Yi ...,367,A homogeneous galactoglucan was purified from ...
2,J,0268-005X,2022,Food Hydrocolloids,Structural characterization and rheological pr...,FSTA:2022-05-Jq6208,Yu-Xiao Wang; Ting Zhang; Jun-Yi Yin; Xiao-Jun...,126,A homogeneous beta-glucan (JHMP-70) was obtain...
3,J,1936-976X,2022,Food Analytical Methods,Dispersive micro-solid-phase extraction of aca...,FSTA:2023-01-Hq0611,Yu-Xin Gu; Tian-Ci Yan; Zi-Xuan Yue; Min-Hui L...,15,A novel dispersive micro-solid-phase extractio...
4,J,0308-8146,2022,Food Chemistry,The identification of biotransformation pathwa...,FSTA:2022-06-Rg2107,Yu-Xue Xu; Ze-Dong Jiang; Xi-Ping Du; Ming-Jin...,380,The yeast Saccharomyces cerevisiae is effectiv...


In [64]:
# cols = ['AU','TI','PY','AB']

journal_df.rename(columns={'AU':'authors',
                            'PY':'year',
                            'TI':'title',
                            'AB':'abstract',
                            'SO':'journal_name'},
                            inplace=True)

journal_df.head()

Unnamed: 0,﻿PT,SN,year,journal_name,title,UT,authors,VL,abstract
0,J,0308-8146,2022,Food Chemistry,Revealing the architecture and solution proper...,FSTA:2022-02-Jq1296,Yu-Xiao Wang; Yue Xin; Jun-Yi Yin; Xiao-Jun Hu...,368,Macrolepiota albuminosa (Berk.) Pegler is abun...
1,J,0308-8146,2022,Food Chemistry,A branched galactoglucan with flexible chains ...,FSTA:2022-01-Aj0385,Yu-Xiao Wang; Yue Xin; Xiao-Jun Huang; Jun-Yi ...,367,A homogeneous galactoglucan was purified from ...
2,J,0268-005X,2022,Food Hydrocolloids,Structural characterization and rheological pr...,FSTA:2022-05-Jq6208,Yu-Xiao Wang; Ting Zhang; Jun-Yi Yin; Xiao-Jun...,126,A homogeneous beta-glucan (JHMP-70) was obtain...
3,J,1936-976X,2022,Food Analytical Methods,Dispersive micro-solid-phase extraction of aca...,FSTA:2023-01-Hq0611,Yu-Xin Gu; Tian-Ci Yan; Zi-Xuan Yue; Min-Hui L...,15,A novel dispersive micro-solid-phase extractio...
4,J,0308-8146,2022,Food Chemistry,The identification of biotransformation pathwa...,FSTA:2022-06-Rg2107,Yu-Xue Xu; Ze-Dong Jiang; Xi-Ping Du; Ming-Jin...,380,The yeast Saccharomyces cerevisiae is effectiv...


In [65]:
imp_cols = ['authors','title','abstract','journal_name']
journal_df = journal_df[imp_cols]
journal_df.head()


Unnamed: 0,authors,title,abstract,journal_name
0,Yu-Xiao Wang; Yue Xin; Jun-Yi Yin; Xiao-Jun Hu...,Revealing the architecture and solution proper...,Macrolepiota albuminosa (Berk.) Pegler is abun...,Food Chemistry
1,Yu-Xiao Wang; Yue Xin; Xiao-Jun Huang; Jun-Yi ...,A branched galactoglucan with flexible chains ...,A homogeneous galactoglucan was purified from ...,Food Chemistry
2,Yu-Xiao Wang; Ting Zhang; Jun-Yi Yin; Xiao-Jun...,Structural characterization and rheological pr...,A homogeneous beta-glucan (JHMP-70) was obtain...,Food Hydrocolloids
3,Yu-Xin Gu; Tian-Ci Yan; Zi-Xuan Yue; Min-Hui L...,Dispersive micro-solid-phase extraction of aca...,A novel dispersive micro-solid-phase extractio...,Food Analytical Methods
4,Yu-Xue Xu; Ze-Dong Jiang; Xi-Ping Du; Ming-Jin...,The identification of biotransformation pathwa...,The yeast Saccharomyces cerevisiae is effectiv...,Food Chemistry


In [66]:
journal_df.shape

(5000, 4)

In [67]:
journal_df.columns

Index(['authors', 'title', 'abstract', 'journal_name'], dtype='object')

In [68]:
journal_df.iloc[1351]

authors                                Kononoff, P.; Stelwage, K.
title           Abstracts of the 2021 American Dairy Science A...
abstract        This supplement includes abstracts for oral an...
journal_name                             Journal of Dairy Science
Name: 1351, dtype: object

In [72]:
journal_df.to_csv('data/journals.csv',index=False)

In [73]:
journal_df['journal_name'].nunique()

57

In [74]:
journal_df['title'].nunique()

4998