**Notes to get keywords and use natural language processing**

original application is for skills gap app

**This is in the base python environment.**

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import re

In [3]:
import nltk

# Overview of python tools

My objective is an application for the skills gap app but as I'm learning more about the tools, I'm seeing that each serves different purposes. Here are notes for each.


- Keywords: scikit-learn's CountVectorizer with TfidfTransformer or TfidfVectorizer
- Natural language: NLTK and/or TextBlob
- Word embedding: Word2vec (mapping of words into vectors of real numbers; apple, mango, banana should be placed close whereas books will be far away from these words).
- Get better at regular expressions


I think my closest application would be natural language but these notes will evolve as I learn more. Per [this post](https://www.quora.com/What-is-the-use-of-NLTK-and-TextBlob-What-is-the-difference-between-both-And-for-text-analysis-which-tool-is-better), there was this quote:

>"If you are new to this field I would advise you to start with NLTK and learn all its concepts which will build up all the basics required in this domain followed by moving towards Sentiment Analysis, Text Classification, Speech Recognition and Question Answering using TextBlob, Scikit-learn, Spacy and Stanford-OpenIE."

Note that the DataSchool [tutorial](https://www.dataschool.io/learn/) does natural language with the scikit-learn packages.

# Inputs

## Resume

In [3]:
import PyPDF2
import docx

ModuleNotFoundError: No module named 'PyPDF2'

In [3]:
# Word document of resume
my_resume = '/Users/lacar/Documents/Goals_and_careers/Education Data Science/Udemy/BL_Resume_UDMY.docx'
my_resume_docx_k = '/Users/lacar/Desktop/Kathleen/Lacar_Resume_Research_2018.docx'

my_resume_pdf = '/Users/lacar/Documents/Goals_and_careers/Education Data Science/Udemy/BL_Resume_UDMY.pdf'
 
# Link to target JD (or saved html)
target_jd = 'https://jobs.lever.co/udemy/6b2e3401-bbd9-48f6-b6d8-e1fa19d801a7'
# target_jd = '/Users/lacar/Documents/Goals_and_careers/Education Data Science/Udemy/Udemy\ -\ Data Scientist\ -\ Insights.html'

In [4]:
# Opening resume with docx
# https://stackoverflow.com/questions/25228106/how-to-extract-text-from-an-existing-docx-file-using-python-docx
def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

resume_text_frdocx = getText(my_resume)
# Cleanup new line, tab, return characters
resume_text_frdocx = resume_text_frdocx.replace('\n', ' ').replace('\uf09f','').replace('\uf020','').replace('\t', '')

resume_text_frdocx_k = getText(my_resume_docx_k)
# Cleanup new line, tab, return characters
resume_text_frdocx_k = resume_text_frdocx_k.replace('\n', ' ').replace('\uf09f','').replace('\uf020','').replace('\t', '')



In [5]:
# Opening resume with pypdf2
# https://medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f

def return_text_fr_pdf(resume_as_pdf):
    resume_pdf = open(resume_as_pdf, 'rb')
    resume_pdf_reader = PyPDF2.PdfFileReader(resume_pdf)  # readable object
    num_pages = resume_pdf_reader.numPages
    count = 0
    resume_text_frpdf = ''
    # Use while loop to read each page
    while count < num_pages:
        page_obj = resume_pdf_reader.getPage(count)
        count +=1
        resume_text_frpdf += page_obj.extractText()
    return resume_text_frpdf

In [6]:
resume_text_frpdf = return_text_fr_pdf(my_resume_pdf)

In [7]:
len(resume_text_frdocx)

4435

In [8]:
def return_random_section_of_text(text, string_length):
    start, end = 0, 0
    while (end + string_length) < len(text):
        start = int(round((np.random.random(1)*len(text))[0]))
        end = start + string_length
        substring_text = text[start:end]
    return substring_text

In [9]:
return_random_section_of_text(resume_text_frdocx, 1000)

'of Learning Center Trainee Co-Chair (2014-15) James S. McDonnell Foundation Postdoctoral Fellowship (2014-15) Neuroplasticity of Aging Postdoctoral Training Grant (2012-13)  NSF Graduate Research Fellowship (14% applicant success rate) (2006-09) UC Education Abroad Program Scholarship – Sweden (2001-02) UCLA Provost’s Honors List (1998-2003)   Volunteer Work and Activities                Provide guest lectures for UCSD Extension Stem Cell Biology Served as Neuroscience Consultant to Educators on UCSD Distinguished Educator Panel Toastmasters New Haven Hill Neighborhood Mentoring Program'

In [10]:
return_random_section_of_text(resume_text_frdocx_k, 1000)

'pa) Member of Psi Chi, National Honor Society in Psychology Howard D. Baker Undergraduate Research Award, third place Dean’s List, 2011-2012; 2002-2006 President’s List, 2011-2012; 2002-2006  CREDENTIALS/CERTIFICATIONS California State Registered Nurse  CCRN (Adult) certified through the American Association of Critical-Care Nurses ACLS and BLS certified through the American Heart Association  PALS certified through the American Heart Association NIH Stroke Scale Certified Data Science in Stratified Healthcare and Precision Medicine (The University of Edinburgh via Coursera)  '

## Opening JD html

In [11]:
import urllib.request
from bs4 import BeautifulSoup
import re

In [12]:
url = target_jd
document = urllib.request.urlopen(url).read().decode()

html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'lxml')
body = soup.body
jd_text = body.get_text(separator='\n')

In [13]:
jd_text = jd_text.replace('\n', ' ').replace('\\n', ' ').replace('\xa0', ' ')

In [15]:
return_random_section_of_text(jd_text, 1000)

'oration, and we share a serious belief in the power of learning and teaching to change lives. Udemy’s culture encourages innovation, creativity, passion, and teamwork. We also celebrate our milestones and support each other every day.  Founded in 2010, Udemy is privately owned and headquartered in San Francisco’s SOMA neighborhood with offices in Denver (Colorado), Dublin (Ireland), Ankara (Turkey), and São Paulo (Brazil).    Udemy in the News: The Key To Solving Future Skills Challenges Algorithms are coming for their jobs, so workers are teaching themselves algorithms Distractions Are Costing Companies Millions. Here\'s Why 66 Percent of Workers Won\'t Talk About It How Soft Skills Can Help You Get Ahead in a Tech World"} var gaCode = "UA-114911611-1"; var gaAllowLinker = false; (function(i,s,o,g,r,a,m){i[\'GoogleAnalyticsObject\']=r;i[r]=i[r]||function(){(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.

# Keyword text extraction

- CountVectorizer followed by TF-IDF transformer [(tutorial)](https://www.freecodecamp.org/news/how-to-extract-keywords-from-text-with-tf-idf-and-pythons-scikit-learn-b2a0f3d7e667/)
    - TF = term frequency, number of times a term appears in a document - *note that CountVectorizer only does this*
    - IDF = log(total number of documents / number of documents that contain a term)
    - The principle of TF-DF is that a document is compared against a lot of other documents to see what's special about it; **it can't get keywords in a single document alone**

- TF-IDF *vectorizer* (equivalent to first option) [scikit-learn documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

- RAKE [(tutorial](https://www.airpair.com/nlp/keyword-extraction-tutorial), [documentation)](https://pypi.org/project/rake-nltk/)



In [18]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import re

# Quantify with CountVectorizer?

In [22]:
# example text
skills = ['storytelling', 'data visualization', 'applied statistics', 'SQL']

In [23]:
vect = CountVectorizer()

In [24]:
vect.fit(skills)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [29]:
# Get fitted vocabulary
vect.get_feature_names()

['applied', 'data', 'sql', 'statistics', 'storytelling', 'visualization']

In [27]:
# Transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(skills)
simple_train_dtm

<4x6 sparse matrix of type '<class 'numpy.int64'>'
	with 6 stored elements in Compressed Sparse Row format>

In [30]:
# This converts sparse matrix to dense matrix
simple_train_dtm.toarray()

# Note the size of the matrix

array([[0, 0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 1],
       [1, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0]])

In [33]:
# Put into a dataframe to examine vocab and matrix together
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names(), index=skills)

Unnamed: 0,applied,data,sql,statistics,storytelling,visualization
storytelling,0,0,0,0,1,0
data visualization,0,1,0,0,0,1
applied statistics,1,0,0,1,0,0
SQL,0,0,1,0,0,0


Each row is a different document

In [None]:
# Note the parameters that are being used from the fit (does in place)

In [10]:
doc_example1 = 'We’re Excited About You Because You Have • 3+ years of relevant experience in a combination of business/analytics roles (e.g. data science/analytics, management consulting, or business operations) • Ability to communicate effectively with non-technical stakeholders • Strong business intuition and judgement • Expert level data storytelling and data visualization ability • Broad knowledge of applied statistics, experimental design, and causal inference • Expert SQL ability and proficiency with 1+ programming languages (e.g., R or Python) • Experience with BI tools (e.g. Looker, Chartio, or Tableau)'
doc_example2 = 'I like baseball'

In [11]:
vect.decode(doc_example1)

'We’re Excited About You Because You Have • 3+ years of relevant experience in a combination of business/analytics roles (e.g. data science/analytics, management consulting, or business operations) • Ability to communicate effectively with non-technical stakeholders • Strong business intuition and judgement • Expert level data storytelling and data visualization ability • Broad knowledge of applied statistics, experimental design, and causal inference • Expert SQL ability and proficiency with 1+ programming languages (e.g., R or Python) • Experience with BI tools (e.g. Looker, Chartio, or Tableau)'

In [12]:
vect.decode(doc_example2)

'I like baseball'

## TF-IDF vectorizer with simple example

In [136]:
# Test cell, following this nice tutorial https://www.youtube.com/watch?v=4vT4fzjkGCQ
d1 = 'the sky is blue'
d2 = 'the sky is not blue'
df_test = pd.DataFrame()
df_test['text'] = [d1, d2]

In [137]:
df_test

Unnamed: 0,text
0,the sky is blue
1,the sky is not blue


In [138]:
tv = TfidfVectorizer(max_df=0.99, stop_words=['the'], use_idf=True)   # note that use_idf by default is False
X = tv.fit_transform(df_test['text'])

### Return values of fit

In [139]:
print(tv.get_feature_names())

['not']


In [140]:
# inverse document requency vector
tv.idf_

array([1.40546511])

In [141]:
tv.vocabulary_

{'not': 0}

In [142]:
tv.stop_words_

{'blue', 'is', 'sky'}

## Resume section

In [143]:
# Resume keywords
df_res = pd.DataFrame()
df_res['text'] = [resume_text_frdocx, resume_text_frdocx_k, jd_text]

In [144]:
df_res.head()

Unnamed: 0,text
0,"BENJAMIN LACAR, Ph.D. 619.419.6227 ben.lacar@..."
1,"KATHLEEN LACAR BSN, RN, CCRN kmuller@gmail.com..."
2,"Data Scientist - Insights San Francisco, Calif..."


In [145]:
# Instantiate CountVectorizer and ignore words that appear in 85% of documents
tv = TfidfVectorizer(max_df=0.85, stop_words='english', use_idf=True)
X = tv.fit_transform(df_res['text'])

In [146]:
X

<3x774 sparse matrix of type '<class 'numpy.float64'>'
	with 857 stored elements in Compressed Sparse Row format>

In [147]:
tv.get_feature_names()

['02',
 '02893b1abca5',
 '05',
 '09',
 '114911611',
 '13',
 '14',
 '15',
 '1508857700820',
 '1663',
 '1998',
 '2001',
 '2002',
 '2003',
 '2004',
 '2006',
 '2007',
 '2009',
 '2011',
 '2012',
 '2013',
 '2014',
 '2015',
 '2017',
 '2018',
 '2019',
 '29',
 '40',
 '419',
 '42424051',
 '444c',
 '5350',
 '591',
 '619',
 '6227',
 '66',
 '78ab6997',
 '850',
 '94080',
 '96',
 '98',
 'aa',
 'ability',
 'able',
 'abroad',
 'academic',
 'accessibility',
 'achievements',
 'acls',
 'act',
 'actionable',
 'activated',
 'activities',
 'acuity',
 'ad',
 'additional',
 'address',
 'addresscountry',
 'addresslocality',
 'addressregion',
 'adept',
 'adult',
 'advance',
 'advancing',
 'adverse',
 'affective',
 'aging',
 'ahead',
 'algorithm',
 'algorithms',
 'allocation',
 'allowlinker',
 'amazonaws',
 'american',
 'analyses',
 'analysis',
 'analytics',
 'analyze',
 'analyzed',
 'angeles',
 'ankara',
 'anonymizeip',
 'answering',
 'applicant',
 'applications',
 'applied',
 'apply',
 'applying',
 'apr',
 'are

In [148]:
tv.vocabulary_

{'02': 0,
 '02893b1abca5': 1,
 '05': 2,
 '09': 3,
 '114911611': 4,
 '13': 5,
 '14': 6,
 '15': 7,
 '1508857700820': 8,
 '1663': 9,
 '1998': 10,
 '2001': 11,
 '2002': 12,
 '2003': 13,
 '2004': 14,
 '2006': 15,
 '2007': 16,
 '2009': 17,
 '2011': 18,
 '2012': 19,
 '2013': 20,
 '2014': 21,
 '2015': 22,
 '2017': 23,
 '2018': 24,
 '2019': 25,
 '29': 26,
 '40': 27,
 '419': 28,
 '42424051': 29,
 '444c': 30,
 '5350': 31,
 '591': 32,
 '619': 33,
 '6227': 34,
 '66': 35,
 '78ab6997': 36,
 '850': 37,
 '94080': 38,
 '96': 39,
 '98': 40,
 'aa': 41,
 'ability': 42,
 'able': 43,
 'abroad': 44,
 'academic': 45,
 'accessibility': 46,
 'achievements': 47,
 'acls': 48,
 'act': 49,
 'actionable': 50,
 'activated': 51,
 'activities': 52,
 'acuity': 53,
 'ad': 54,
 'additional': 55,
 'address': 56,
 'addresscountry': 57,
 'addresslocality': 58,
 'addressregion': 59,
 'adept': 60,
 'adult': 61,
 'advance': 62,
 'advancing': 63,
 'adverse': 64,
 'affective': 65,
 'aging': 66,
 'ahead': 67,
 'algorithm': 68,
 'al

In [156]:
pd.DataFrame(X.tocoo().data)

Unnamed: 0,0
0,0.040827
1,0.093151
2,0.081655
3,0.040827
4,0.040827
5,0.040827
6,0.040827
7,0.031050
8,0.031050
9,0.040827


**These tools give single words by default. I think knowing phrases/sentiments would be more useful for me.**

# Basic NLP with scikit-learn

- Kevin's philosophy is to be really good at scikit-learn and do natural language processing there, rather than being OK at sklearn and OK at NLTK and stitching them together (although I may have to do this)
- These are simple techniques and he thinks you can go far with it

- NLP means building probabilistic models using data about a language. It might be something like identifying the subject, especially if it's a formal pronoun like a named entity (e.g. probability that a capitalized word is a subject).
- You need to understand each language you'll work with and how the world uses it (understanding idioms, sarcasm, etc.)

**Note: Natural language processing is not equivalent to machine learning with text. Machine learning with text is one tool in the arsenal of NLP.**


Higher level "task areas"
- Information retrieval (Google search)
- Information from extraction (events from email)
- Machine translation (Google translate)
- Text simplification (simple Wikipedia)
- Predictive text (texting)
- Sentiment analysis (Hater News)
- Automatic summarization (generating an abstract)
- Natural language generation (sports summary)
- Speech recognition and generation (speech-to-text)
- Question answering (supercomputer Watson on Jeopardy)

The above can be broken down into lower level "components"

- Tokenization
- Stop word removal
- TF-IDF (computing word importance)
- Stemming and lemmatization (running > run)
- Part-of-speech tagging
- Named entity recognition
- Segmentation
- Word sense disambiguation ("buy a mouse")
- Spelling correction
- Language detection
- Machine learningm

# NLTK

following [this](https://www.youtube.com/watch?v=FLZvOKSCkxY)


Terms:
- corpora - body of text (medical journals, presidential speeches)
- lexicon - words and their meanings


## Tokenizing

In [8]:
nltk.download()

NameError: name 'nltk' is not defined

In [None]:
# note downloaded to /users/lacar/nltk_data

In [13]:
from nltk.tokenize import sent_tokenize, word_tokenize

# This can save lots of time versus doing it by regular expressions

ModuleNotFoundError: No module named 'nltk'

In [41]:
# Wanted to see if and would be listed once or it would list each example
example_text = 'Hello Dr. Ben, how are you? The weather is windy and Python is great! I like being alive. and and'

In [39]:
sent_tokenize(example_text)

['Hello Dr. Ben, how are you?',
 'The weather is windy and Python is great!',
 'I like being alive.',
 'and and']

In [40]:
word_tokenize(example_text)

['Hello',
 'Dr.',
 'Ben',
 ',',
 'how',
 'are',
 'you',
 '?',
 'The',
 'weather',
 'is',
 'windy',
 'and',
 'Python',
 'is',
 'great',
 '!',
 'I',
 'like',
 'being',
 'alive',
 '.',
 'and',
 'and']

### Using resume as example for tokenizing

In [44]:
sent_tokenize(resume_text_frdocx)[0:5]

['BENJAMIN LACAR, Ph.D. 619.419.6227  ben.lacar@gmail.com  www.linkedin.com/in/lacar/ github.com/benslack19   Professional highlights  4+ years as an applications and bioinformatics scientist that influences product development and business messaging through data analysis and visualizations.',
 'Skilled in applying Python and R for analysis of high-throughput DNA sequencing data.',
 'Adept at communicating scientific and statistical concepts in professional and educational settings with storytelling.',
 'Extensive academic background (14 publications) in neuroscience.',
 'Skills  R Python MATLAB Statistics Machine Learning Bioinformatics Data Visualization SQL Online Teaching   Professional Experience  Fluidigm Corporation | South San Francisco, CA Bioinformatics Scientist | Genomics R&D                                                                      Jan 2019 - present  Built pipeline for RNA-seq library preparation method, reported metrics, and created ad hoc data visualizations 

## Stop words

In [46]:
# Not super helpful - just a way to import stop words and then you can filter them out using a list comprehension
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [50]:
# not sure why it has to be a set
stop_words = set(stopwords.words('english'))

In [51]:
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [49]:
# Note this is much more than video example, because it's been updated
len(stopwords.words('english'))

179

## Stemming

## Parts of speech tagging

Others:
https://www.youtube.com/watch?v=w36-U-ccajM&list=PLQVvvaa0QuDf2JswnfiGkliBInZnIC4HL&index=2
    
Chunking, chinking, named entity recognition, lemmatizing, NLTK Corpora, wordnet, text classification, words as features for learning, 

# TextBlob

https://textblob.readthedocs.io/en/dev/

# Word2vec

apple + purple = plum

# Recommendations to fill skills gap

In [None]:
query_subject = 'SQL'
url = 'https://www.coursera.org/search?query=' + query_subject
# socket = urllib.request.urlopen(url)
document = urllib.request.urlopen(url).read().decode()

# Bag of words