<a href="https://colab.research.google.com/github/connectasp/NLP_Task8/blob/main/NLP_Task3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Create an application that should be used by the HR Team to filter the resume based on the Skills.**

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('UpdatedResumeDataSet.csv')
df.head()

Unnamed: 0,Category,Resume
0,Data Science,Skills * Programming Languages: Python (pandas...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...
2,Data Science,"Areas of Interest Deep Learning, Control Syste..."
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab..."


In [3]:
df.shape

(962, 2)

In [4]:
df['Category'].value_counts()

Java Developer               84
Testing                      70
DevOps Engineer              55
Python Developer             48
Web Designing                45
HR                           44
Hadoop                       42
Blockchain                   40
ETL Developer                40
Operations Manager           40
Data Science                 40
Sales                        40
Mechanical Engineer          40
Arts                         36
Database                     33
Electrical Engineering       30
Health and fitness           30
PMO                          30
Business Analyst             28
DotNet Developer             28
Automation Testing           26
Network Security Engineer    25
SAP Developer                24
Civil Engineer               24
Advocate                     20
Name: Category, dtype: int64

In [5]:
df['Category'].nunique()

25

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 962 entries, 0 to 961
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  962 non-null    object
 1   Resume    962 non-null    object
dtypes: object(2)
memory usage: 15.2+ KB


**Data Cleaning & Filtering**

In [17]:
#Here we are using Regular Expression to clean the text 
#removing text lowercase, remove text in square brackets,remove links,remove punctuation and remove words containing numbers.
import re
import string
def clean_text(text):
    text = re.sub('\[.*?\]', '', text)# Removing unwanted special characters
    text = re.sub('<.*?>+', '', text)# Removing symbols and Tags
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)# Removing String Punctuations
    text = re.sub('http\S+\s*', ' ', text) # Removing web Address
    text = re.sub('@S+', '', text)#Removing Mentions
    text = re.sub('#S+', '', text)#Removing Hashtags
    text = re.sub(r'[^\x00-\x7f]',r' ', text)#Removing non ascii values
    text = re.sub('\s+', ' ', text)#Removing Unwanted white spaces
    return text


df['cleaned_resume'] = df['Resume'].apply(lambda x : clean_text(x))

In [18]:
df.head()

Unnamed: 0,Category,Resume,cleaned_resume
0,6,Skills * Programming Languages: Python (pandas...,Skills Programming Languages Python pandas num...
1,6,Education Details \r\nMay 2013 to May 2017 B.E...,Education Details May 2013 to May 2017 BE UITR...
2,6,"Areas of Interest Deep Learning, Control Syste...",Areas of Interest Deep Learning Control System...
3,6,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...,Skills R Python SAP HANA Tableau SAP HANA SQL ...
4,6,"Education Details \r\n MCA YMCAUST, Faridab...",Education Details MCA YMCAUST Faridabad Haryan...


**Encoding**

In [19]:
from sklearn.preprocessing import LabelEncoder
category = ['Category']
labelencoder = LabelEncoder()
for i in category:
    df[i] = labelencoder.fit_transform(df[i])
df.head()

Unnamed: 0,Category,Resume,cleaned_resume
0,6,Skills * Programming Languages: Python (pandas...,Skills Programming Languages Python pandas num...
1,6,Education Details \r\nMay 2013 to May 2017 B.E...,Education Details May 2013 to May 2017 BE UITR...
2,6,"Areas of Interest Deep Learning, Control Syste...",Areas of Interest Deep Learning Control System...
3,6,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...,Skills R Python SAP HANA Tableau SAP HANA SQL ...
4,6,"Education Details \r\n MCA YMCAUST, Faridab...",Education Details MCA YMCAUST Faridabad Haryan...


**Model Building**

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer   

In [21]:
TFIDF = TfidfVectorizer(sublinear_tf=True, stop_words='english', max_features=1000)
X = TFIDF.fit_transform(df['cleaned_resume'].values)

In [22]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

In [23]:
Y = df['Category'].values
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.3,random_state = 0)
OVR = OneVsRestClassifier(KNeighborsClassifier())
OVR.fit(x_train, y_train)
predicted_value = OVR.predict(x_test)

**Evaluations**

In [24]:
print(classification_report(y_test,predicted_value))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         4
           1       1.00      1.00      1.00         4
           2       1.00      0.88      0.93         8
           3       1.00      1.00      1.00        15
           4       0.91      1.00      0.95        10
           5       1.00      1.00      1.00        10
           6       0.88      1.00      0.93        14
           7       1.00      0.80      0.89        10
           8       1.00      0.87      0.93        15
           9       1.00      1.00      1.00        10
          10       1.00      1.00      1.00        11
          11       0.93      1.00      0.96        13
          12       1.00      1.00      1.00        12
          13       1.00      1.00      1.00        13
          14       1.00      1.00      1.00         9
          15       0.90      1.00      0.95        26
          16       1.00      1.00      1.00         9
          17       1.00    

**Training Data Accuracy**

In [25]:
print(OVR.score(x_train,y_train))

0.9925705794947994


**Testing Data Accuracy**

In [26]:
print(OVR.score(x_test,y_test))

0.972318339100346
