# GROUP 3 FIRST ATTEMPT
## What this document is?
This document is our groups first attempt at extracting keywords from a given group of documents. For this code the group of documents is given in a .csv which will be read by the program before being processed.

## Setting Up
To begin, the choosen CSV file that to be queried needs to be placed in the same folder as this notebook and the name between the '' to the name of the file. 

For example if the files name was Default.CSV the code should read as;
    pd.read_csv('Default.CSV', sep=";", header=None).
    
Unfortunately the .csv that this code was made for can not be uploaded due to confidentiality.

In [8]:
import pandas as pd
df = pd.read_csv("2018_WoS.csv", sep=";", header=None)

## Previewing your data
Below is a preview of the start and end of the data found in the uploaded CSV

In [2]:
df.head()

Unnamed: 0,0,1,2,3
0,405496000001,COMMUNICATIONS IN NONLINEAR SCIENCE AND NUMERI...,Research on the reliability of friction system...,"In this paper, the reliability of a non-linear..."
1,405496000002,COMMUNICATIONS IN NONLINEAR SCIENCE AND NUMERI...,Vector solitons in coupled nonlinear Schroding...,The dynamics of two-component solitons is stud...
2,405496000003,COMMUNICATIONS IN NONLINEAR SCIENCE AND NUMERI...,Analysis of cyclical behavior in time series o...,In this paper we have analyzed scaling propert...
3,405496000004,COMMUNICATIONS IN NONLINEAR SCIENCE AND NUMERI...,Impact of marine reserve on maximum sustainabl...,Multispecies fisheries management requires man...
4,405496000005,COMMUNICATIONS IN NONLINEAR SCIENCE AND NUMERI...,Double-well chimeras in 2D lattice of chaotic ...,We investigate spatio-temporal dynamics of a 2...


In [3]:
df.tail()

Unnamed: 0,0,1,2,3
1630345,501568200001,TURKISH JOURNAL OF BIOCHEMISTRY-TURK BIYOKIMYA...,"TBS International Biochemistry Congress 2018, ...",
1630346,502087800001,JOURNAL OF GASTROINTESTINAL AND LIVER DISEASES,The 5th Romanian-German Symposium of Gastroent...,
1630347,502088600001,JOURNAL OF GASTROINTESTINAL AND LIVER DISEASES,The XXXVIIIth National Congress of Gastroenter...,
1630348,502089100001,JOURNAL OF GASTROINTESTINAL AND LIVER DISEASES,The 10th National Symposium on Inflammatory Bo...,
1630349,502648200001,JOURNAL OF PORPHYRINS AND PHTHALOCYANINES,Electrochemistry of zinc tetraarylporphyrins c...,Two series of zinc tetraarylporphyrins contain...


## Checking Compatability
In order for the program to function correctly the data must be of a particular format of 4 columns with the labels for each being Publication_Code , Journal_Name, Title and Abstract. Below the shape of the data is checked, if the second numebr is '4' it means that the data has the correct number of columns. IF this is not the case try to edit the data to either add or remove  columns.

In [4]:
df.shape

(1630350, 4)

If the number of columns is correct it is now time to address the column headings. If these are not already correctly label, as stated above, please run the following code. This should apply the correct headings to the data and give a preview where you can check the table has been labelled correctly. If the labels have been labelled in an incorrect order the order can be changed in the code below. This is by rearranging the labels into the correct order e.g. from ['Publication_Code','Journal_Name', 'Title', 'Abstract'] to ['Abstract', 'Publication_Code','Journal_Name', 'Title'].

In [5]:
df.columns = ['Publication_Code','Journal_Name', 'Title', 'Abstract']
print(df.head())
print(df.tail())

   Publication_Code                                       Journal_Name  \
0      405496000001  COMMUNICATIONS IN NONLINEAR SCIENCE AND NUMERI...   
1      405496000002  COMMUNICATIONS IN NONLINEAR SCIENCE AND NUMERI...   
2      405496000003  COMMUNICATIONS IN NONLINEAR SCIENCE AND NUMERI...   
3      405496000004  COMMUNICATIONS IN NONLINEAR SCIENCE AND NUMERI...   
4      405496000005  COMMUNICATIONS IN NONLINEAR SCIENCE AND NUMERI...   

                                               Title  \
0  Research on the reliability of friction system...   
1  Vector solitons in coupled nonlinear Schroding...   
2  Analysis of cyclical behavior in time series o...   
3  Impact of marine reserve on maximum sustainabl...   
4  Double-well chimeras in 2D lattice of chaotic ...   

                                            Abstract  
0  In this paper, the reliability of a non-linear...  
1  The dynamics of two-component solitons is stud...  
2  In this paper we have analyzed scaling propert... 

## Preprocessing Data will be Needed
From the above preview we can see that there are many NaNs, the number of which is given in the code below, which will have to be removed later by preprocessing.

In [6]:
df.isna().sum()

Publication_Code        0
Journal_Name            0
Title                   0
Abstract            34492
dtype: int64

## Breaking Down the Data
The following code disects any given data by finding all available journal names and places them in to a list before giving them back.After being given some key words the following algorithm can then find journal names that contain the key words. 


In [7]:
import numpy as np
import nltk 

chosen_idx = np.random.choice(1000, replace=False, size=50)
sample_df = df.iloc[chosen_idx]

journal_names = df.Journal_Name.unique()

j_list = journal_names.tolist()
print("All Journal Names")
print(j_list)

key_words = ['sustainable', 'sustainibility', 'renewable']

key_journals = []

for line in j_list:
    tmp = line.split()
    for word in key_words:
        for w in tmp:
            if w.lower()==word.lower():
                tmp1 = w
                tmp2 = tmp
                key_journals.append([tmp1, tmp2])
print(f'There are {len(key_journals)} journals using key words: {[i for i in key_words]}')

All Journal Names
['COMMUNICATIONS IN NONLINEAR SCIENCE AND NUMERICAL SIMULATION', 'IEEE JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS', 'JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS', 'BERNOULLI', 'FOOD CHEMISTRY', 'SCIENCE CHINA-INFORMATION SCIENCES', 'FOOD HYDROCOLLOIDS', 'JOURNAL OF CELLULAR AUTOMATA', 'NURSING HISTORY REVIEW', 'JOURNAL OF APPLIED POLYMER SCIENCE', 'MEASUREMENT', 'RECORDS OF NATURAL PRODUCTS', 'JOURNAL OF CELLULAR PHYSIOLOGY', 'BIOINTERPHASES', 'SCIENCE OF THE TOTAL ENVIRONMENT', 'INTERNATIONAL JOURNAL OF COMPUTATIONAL METHODS', 'COMPUTER SPEECH AND LANGUAGE', 'JOURNAL OF COMBINATORIAL THEORY SERIES A', 'MATHEMATICAL BIOSCIENCES AND ENGINEERING', 'LEUKEMIA & LYMPHOMA', 'JOURNAL OF MATHEMATICAL ANALYSIS AND APPLICATIONS', 'INTERNATIONAL JOURNAL OF AUTOMOTIVE TECHNOLOGY', 'INFORMATION FUSION', '2D MATERIALS', 'JOURNAL OF FOOD ENGINEERING', 'JOURNAL OF HAZARDOUS MATERIALS', 'POSTHARVEST BIOLOGY AND TECHNOLOGY', 'DISCRETE AND CONTINUOUS DYNAMICAL SYSTEMS', 'IEEE TRANS