In [1]:
import re
from collections import Counter

from bs4 import BeautifulSoup
import pandas as pd
import requests

# Text Data Assignment

# 1. String Basics

## Count Characters

Write a function to count the character frequency of any string, and return the counts as a dictionary.

"google.com" -> {'o': 3, 'g': 2, '.': 1, 'e': 1, 'l': 1, 'm': 1, 'c': 1}

In [2]:
def count_letters(string):
    d = {}
    for letter in set(string):
        d[letter] = string.count(letter)

    return d

count_letters('google.com')

{'.': 1, 'c': 1, 'e': 1, 'g': 2, 'l': 1, 'm': 1, 'o': 3}

## Replace Characters

Write a function that takes in two parameters:
1. a string of text
2. individual character.

This function should return a new string of text where any of the specified character have been replaced with dollar signs: `$` 

In [3]:
def redact(document, redaction_char="$"):
    return document.replace(redaction_char, '$')

redact('google', 'o')

'g$$gle'

## Slice Suffixes

Write a function that removes common suffixes from the ends of words. As a minimum this function should remove the suffixes: "ed",  "ing", "es", "tion", "ly"

"The daring fox leaped through the air gracefully, his eyes fixated on the capitulation of his prey." -> 

"The dar fox leap through the air graceful, his ey fixat on the capitula of his prey."

In [4]:
def bad_suffix_cleaner(document, suffixes):
    for suffix in suffixes:
        document = document.replace(suffix, '')
    return document

suffixes = ["ed", "ing", "es", "tion", "ly"]
document = "The daring fox leaped through the air gracefully, his eyes fixated on the capitulation of his prey."
bad_suffix_cleaner(document, suffixes)

'The dar fox leap through the air graceful, his ey fixat on the capitula of his prey.'

## Remove Stopwords

Write a function that removes common "stopwords" from text. 

In [5]:
stopwords = [
    'i','me','my','myself','we''our','ours','ourselves',
    'you','your','yours','yourself','yourselves','he','him','his','himself',
    'she','her','hers','herself','it','its','itself','they','them','their',
    'theirs','themselves','what','which','who','whom','this','that','these',
    'those','am','is','are','was','were','be','been','being','have','has',
    'had','having','do','does','did','doing','a','an','the','and','but',
    'if','or','because','as','until','while','of','at','by','for','with',
    'about','against','between','into','through','during','before','after',
    'above','below','to','from','up','down','in','out','on','off','over',
    'under','again','further','then','once','here','there','when','where',
    'why','how','all','any','both','each','few','more','most','other','some',
    'such','no','nor','not','only','own','same','so','than','too','very',
    's','t','can','will','just','don','should','now'
]

In [6]:
def stopwords_cleaner(document, stopwords):
    words = re.findall('[a-zA-Z]+', document.lower())
    words = [word for word in words if word not in stopwords]
    return words

stopwords_cleaner(document, stopwords)

['daring',
 'fox',
 'leaped',
 'air',
 'gracefully',
 'eyes',
 'fixated',
 'capitulation',
 'prey']

## 1.5 Vectorize Words

Below is a list of three strings. Each string is a job listing with the job title of "data scientist" from indeed.com. Write a function that does two things:

1) Removes stopwords from each listing (uses above function)

2) Creates a dataframe where the header of each column is a particular word and each cell of the dataframe should be a 1 or 0 denoting whether that word is present or not in the job listing a body of text. The final dataframe should only have 3 rows, one for each of the three job listings.

Your final dataframe should not include any of the stopwords.

In [7]:
job_listings = [
    'Part-time, Contract, Internship\nSr. Machine Learning/Data Scientist\n\ndata245, Bannockburn, IL seeks data scientists.\n\nWe are open to all levels of experience (down to an intern) as we are building a team around new initiatives.\n\nYou will be developing state of the art algorithms to power various aspects of highly complex business models\nYou can articulate and understand a business problem, identify challenges, formulate the machine learning problem or NLP problems and provide/prototype solutions\nYou will provide technical leadership, identify and understand key business challenges and opportunities, and develop end-to-end software solutions using machine learning/NLP and optimization methods.\n\nYou will collaborate extensively with internal and external partners, program management, and, at a senior level, the engineering team to ensure that solution meet business needs, permit valid inferences and have functional feasibility\nYou will collect and manipulate large volumes of data; build new and improved techniques and/or solutions for data collection, management, and usage\nYou will communicate results in a comprehensible manner to all levels of the company (field teams up to Snr. Management) - this will require client facing in the future - but not initially\nYou will brainstorm with other team members and leadership - who has 30 plus years experience in the industry that requires the solution.\nRequirements:\n\nPHD or MS in Statistics, Machine Learning, or Computer Science (or technical degree with commensurate industry experience)\nIdeally the Senior position will possess at least 3 years of relevant work - or academic academic experience, as a Data Scientist / Machine Learning professional.\nExpertise in NLP a bonus\nStrong algorithmic design skills\nOther positions require less tenure, but the same relevant ML understanding.\nPrevious hands on experience, or thesis dedicated to the same\nDeep understanding of classic machine learning and deep learning theory, and extensive hands-on experience putting it into practice\nExcellent understanding of machine learning algorithms, processes, tools and platforms including CNN, RNN, NLP, tensorflow, keras, etc.\nPython proficiency is must\nApplied experience with machine learning on large datasets\\sparse data with structured and unstructured data.\nExperience with deep learning, and their optimizations for efficient implementation.\nGreat communication skills, ability to explain predictive analytics to non-technical audience (not client facing yet, and no sales)\nExperienced in predictive modelling.\nExecute analytical experiments methodically while outputting reproducible research.\nExcited to change an industry struggling to control costs.\nGood to have – Familiar with one or more programming languages e.g. C++ / Java / Android / iOS".\nJob Types: Full-time, Part-time, Internship, Contract\n\nSalary: $75,000.00 to $125,000.00 /year\n\nEducation:\n\nMaster\'s (Preferred)\nWork authorization:\n\nUnited States (Preferred)\nHours per week:\n\n30-39\nOvertime often available:\n\nNo\nContract Length:\n\nMore than 1 year\nTypical end time:\n\n5PM', 
    "$96,970 - $148,967 a year\nThe professionals at the National Security Agency (NSA) have one common goal: to protect our nation. The mission requires a strong offense and a steadfast defense. The offense collects, processes, and disseminates intelligence information derived from foreign signals for intelligence and counterintelligence purposes. The defense prevents adversaries from gaining access to sensitive classified national security information. NSA is the nation's leader in providing foreign signals intelligence while also protecting U.S. government information systems, forging the frontier of communications, and data analysis. We serve the American people by applying technical skills to meaningful work, keeping our friends and families safe for generations to come. You will make a lasting impact serving your country as a Data Scientist at the National Security Agency, using your curiosity to analyze large data sets to inform decision-making against foreign threats. We are looking for critical thinkers, problem solvers, and motivated individuals who are enthusiastic about data and believe that answers to hard questions lie in the yet-to-be-told story of diverse, complicated data sets. You will employ your mathematical science, computer science, and quantitative analysis skills to ensure solutions to complex data problems and take full advantage of the NSA's software and hardware capabilities in all areas of our enterprise, including analytic capabilities, research, and foreign intelligence operations. Data Scientists are hired into positions directly supporting a technical mission office or the Data Scientist Development Program (DSDP). The NSA/CSS Data Scientist Development Program is a three-year opportunity to build your data science talent, experience the breadth of data science at NSA through six- to nine-month assignments in a variety of diverse organizations, and collaborate with NSA's experts in the field of data science. You will have opportunities to attend technical conferences with experts from industry and academia. You will routinely discuss and share NSA's challenges and successes at weekly technical roundtables. We foster an environment where you will develop your data science skills, allowing you to quickly contribute to NSA's mission. As a member of a technical mission office or the DSDP, Data Scientists tackle challenging real-world problems leveraging big data, high-performance computing, machine learning, and a breadth of other methodologies. As a Data Scientist at NSA, responsibilities may include: - Collecting and combining data from multiple sources - Uncovering and exploring anomalous data (including metadata) - Applying the scientific process to data evaluation, performing statistical inference, and data mining - Developing analytic plans, engineer supporting algorithms, and design and implement solutions which execute analytic plans. - Designing and developing tools and techniques for analysis - Analyzing data using mathematical/statistical methods - Evaluating, documenting, and communicating research processes, analyses, and results to customers, peers, and leadership - Creating interpretable visualizations\n\nSkills\n\nThe ideal candidate is someone with a desire for continual learning and strong problem-solving, analytic and interpersonal skills. You might be a great fit for our team if any of the following describe you: - Completed a degree program in the fields of mathematics, statistics, computer science, computational sciences, or a passion for rigorous analysis of data - Tenacity, integrity, persistence, and willingness to learn - Ability to solve complex problems - Use critical thinking and reasoning to make analytic determinations - Works effectively in a collaborative environment - Strong communications skills to both technical and non-technical audiences - The desire to serve over 300 million fellow Americans and make a difference in world events\n\nPay, Benefits, & Work Schedule\n\nOn the job training, internal NSA courses, and external training will be made available based on the need and experience of the selectee. Monday - Friday, with basic 8 hr/day requirements between 0800 to 1800 (flexible)\n\nPosition Summary\n\nNSA is in search of Computer Science professionals to solve complex problems, test innovative approaches and research new solutions to storing, manipulating, and presenting information. We are looking for you to apply your computer science expertise to projects that seek to create new standards for the transformation of information. If you want to develop technologies and tools and be a part of cutting edge innovations ' join our team of experts! Help protect national security interests as part of the world's most advanced team of computer science professionals!\n\nMandatory Qualification Reqs\n\nCandidates for the NSA's Data Scientist roles are asked to complete a data science examination evaluating their knowledge of statistics, mathematics, and computer science topics that pertain to data science work. Passing this examination is a requirement in order to be considered for selection into a data scientist position. Salary Range: $69,545 - $86,659 (Entry Level/Developmental) *The qualifications listed are the minimum acceptable to be considered for the position. Salary offers are based on candidates' education level and years of experience relevant to the position and also take into account information provided by the hiring manager/organization regarding the work level for the position. Entry is with a Bachelor's degree and no experience. An Associate's degree plus 2 years of relevant experience may be considered for individuals with in-depth experience that is clearly related to the position. Degree must be in Mathematics, Applied Mathematics, Statistics, Applied Statistics, Machine Learning, Data Science, Operations Research, or Computer Science. A degree in a related field (e.g., Computer Information Systems, Engineering), a degree in the physical/hard sciences (e.g. physics, chemistry, biology, astronomy), or other science disciplines (i.e., behavioral, social, and life) may be considered if it includes a concentration of coursework (typically 5 or more courses) in advanced mathematics (typically 200 level or higher; such as calculus, differential equations, discrete mathematics) and/or computer science (e.g., algorithms, programming, data structures, data mining, artificial intelligence). College-level Algebra or other math courses intended to meet a basic college level requirement, or upper level math courses designated as elementary or basic do not count. Note: A broader range of degrees will be considered if accompanied by a Certificate in Data Science from an accredited college/university.\n\nRelevant experience must be in designing/implementing machine learning, data mining, advanced analytical algorithms, programming, data science, advanced statistical analysis, artificial intelligence, and/or software engineering. Experience in more than one area is strongly preferred. Salary Range: $80,445 - $107,140 (Full Performance) *The qualifications listed are the minimum acceptable to be considered for the position. Salary offers are based on candidates' education level and years of experience relevant to the position and also take into account information provided by the hiring manager/organization regarding the work level for the position. Entry is with a Bachelor's degree plus 3 years of relevant experience or a Master's degree plus 1 year of relevant experience or a Doctoral degree and no experience. An Associate's degree plus 5 years of relevant experience may be considered for individuals with in-depth experience that is clearly related to the position.\nDegree must be in Mathematics, Applied Mathematics, Statistics, Applied Statistics, Machine Learning, Data Science, Operations Research, or Computer Science. A degree in a related field (e.g., Computer Information Systems, Engineering), a degree in the physical/hard sciences (e.g. physics, chemistry, biology, astronomy), or other science disciplines (i.e., behavioral, social, and life) may be considered if it includes a concentration of coursework (typically 5 or more courses) in advanced mathematics (typically 200 level or higher; such as calculus, differential equations, discrete mathematics) and/or computer science (e.g., algorithms, programming, data structures, data mining, artificial intelligence). College-level Algebra or other math courses intended to meet a basic college level requirement, or upper level math courses designated as elementary or basic do not count. Note: A broader range of degrees will be considered if accompanied by a Certificate in Data Science from an accredited college/university. Relevant experience must be in two or more of the following: designing/implementing machine learning, data mining, advanced analytical algorithms, programming, data science, advanced statistical analysis, artificial intelligence, or software engineering. Salary Range: $96,970 - $148,967 (Senior) *The qualifications listed are the minimum acceptable to be considered for the position. Salary offers are based on candidates' education level and years of experience relevant to the position and also take into account information provided by the hiring manager/organization regarding the work level for the position. Entry is with a Bachelor's degree plus 6 years of relevant experience or a Master's degree plus 4 years of relevant experience or a Doctoral degree plus 2 years of relevant experience. An Associate's degree plus 8 years of relevant experience may be considered for individuals with in-depth experience that is clearly related to the position. Degree must be in Mathematics, Applied Mathematics, Statistics, Applied Statistics, Machine Learning, Data Science, Operations Research, or Computer Science. A degree in a related field (e.g., Computer Information Systems, Engineering), a degree in the physical/hard sciences (e.g., physics, chemistry, biology, astronomy), or other science disciplines (i.e., behavioral, social, life) may be considered if it includes a concentration of coursework (typically 5 or more courses) in advanced mathematics (typically 200 level or higher; such as calculus, differential equations, discrete mathematics) and/or computer science (e.g., algorithms, programming, data structures, data mining, artificial intelligence). College-level Algebra or other math courses intended to meet a basic college level requirement, or upper level math courses designated as elementary or basic do not count. Note: A broader range of degrees will be considered if accompanied by a Certificate in Data Science from an accredited college/university. Relevant experience must be in two or more of the following: designing/implementing machine learning, data mining, advanced analytical algorithms, programming, data science, advanced statistical analysis, artificial intelligence, or software engineering. Salary Range: $134,789- $164,200 (Expert) *The qualifications listed are the minimum acceptable to be considered for the position. Salary offers are based on candidates' education level and years of experience relevant to the position and also take into account information provided by the hiring manager/organization regarding the work level for the position. Entry is with a Bachelor's degree plus 9 years of relevant experience or a Master's degree plus 7 years of relevant experience or a Doctoral degree plus 5 years of relevant experience. An Associate's degree plus 11 years of relevant experience may be considered for individuals with in-depth experience that is clearly related to the position. Degree must be in Mathematics, Applied Mathematics, Statistics, Applied Statistics, Machine Learning, Data Science, Operations Research, or Computer Science. A degree in a related field (e.g., Computer Information Systems, Engineering), a degree in the physical/hard sciences (e.g., physics, chemistry, biology, astronomy), or other science disciplines (i.e., behavioral, social, life) may be considered if it includes a concentration of coursework (typically 5 or more courses) in advanced mathematics (typically 200 level or higher; such as calculus, differential equations, discrete mathematics) and/or computer science (e.g., algorithms, programming, data structures, data mining, artificial intelligence). College-level Algebra or other math courses intended to meet a basic college level requirement, or upper level math courses designated as elementary or basic do not count. Note: A broader range of degrees will be considered if accompanied by a Certificate in Data Science from an accredited college/university. Relevant experience must be in two or more of the following: designing/implementing machine learning, data mining, advanced analytical algorithms, programming, data science, advanced statistical analysis, artificial intelligence, or software engineering.\n\nHow To Apply - External\n\nTo apply for this position, please click the 'Apply' button located at the top right of this posting. After completing the application for the first time, or reviewing previously entered information, and clicking the 'Submit' button, you will receive a confirmation email. Please ensure your spam filters are configured to accept emails from noreply@intelligencecareers.gov. ***PLEASE NOTE: U.S. Citizenship is required for all applicants. Reasonable accommodations provided to applicants with disabilities during the application and hiring process where appropriate. NSA is an equal opportunity employer and abides by applicable employment laws and regulations. All applicants and employees are subject to random drug testing in accordance with Executive Order 12564. Employment is contingent upon successful completion of a security background investigation and polygraph. This position is a Defense Civilian Intelligence Personnel System (DCIPS) position in the Excepted Service under 10 U.S.C. 1601. DoD Components with DCIPS positions apply Veterans' Preference to eligible candidates as defined by Section 2108 of Title 5 USC, in accordance with the procedures provided in DoD Instruction 1400.25, Volume 2005, DCIPS Employment and Placement. If you are a veteran claiming veterans' preference, as defined by Section 2108 of Title 5 U.S.C., you may be asked to submit documents verifying your eligibility. Please note that you may be asked a series of questions depending on the position you apply for. Your responses will be used as part of the screening process of your application and will assist in determining your eligibility for the position. Be sure to elaborate on experiences in your resume. Failure to provide the required information or providing inaccurate information will result in your application not being considered for this position. Only those applicants who meet the qualifications for the position will be contacted to begin employment processing. Please Note: Job Posting could close earlier than the closing date due to sufficient number of applicants or position no longer available. We encourage you to apply as soon as possible.\n\nDCIPS Disclaimer\n\nThe National Security Agency (NSA) is part of the DoD Intelligence Community Defense Civilian Intelligence Personnel System (DCIPS). All positions in the NSA are in the Excepted Services under 10 United States Codes (USC) 1601 appointment authority.", 
    '\nMinneapolis, Providence or Framingham\n\nWho is Virgin Pulse?\nVirgin Pulse, founded as part of Sir Richard Branson’s famed Virgin Group, helps organizations build employee health and wellbeing into the DNA of their corporate cultures. As the only company to deliver a powerful, mobile-first digital platform infused with live services, including coaching and biometric screenings, Virgin Pulse’s takes a high-tech-meets-high-touch-approach to engage employees in improving across all aspects of their health and wellbeing, every day – from prevention and building a healthy lifestyle to condition and disease management to condition reversal, all while engaging users daily in building and sustaining healthy habits and behaviors. A global leader in health and wellbeing, Virgin Pulse is committed to helping change lives and businesses around the world for good so that people and organizations can thrive, together. Today, more than 3100 organizations across the globe are using Virgin Pulse’s solutions to improve health, employee wellbeing and engagement, reduce costs and create strong workplace cultures.\n\nWho are our employees?\nAt Virgin Pulse we’re passionate about changing lives for good. We want to make a difference in the world by helping people be healthy so they can perform at their best, every day, at work and home. Our award-winning solutions support leading employers in improving and simplifying the employee health and wellbeing journey and engaging people in all aspects of their health. But our world-class products and programs are nothing without our people – the employees who design, build, promote, sell, test and perfect the latest innovations in workplace health and wellbeing. Our people are our top priority and we invest in their health and happiness. At Virgin Pulse, we have so much more than a strong, supportive company culture – have a shared vision for a healthier, happier world.\nWho you are.\nYou are an experienced Data Scientist who is capable of providing support to our organization’s efforts to maintain an innovative leadership position in the employee engagement SaaS industry. The Data Scientist accesses datasets from various sources, conducts analysis, and presents the findings of each analytic and reporting project. The incumbent will be able to interpret the findings and clearly communicate results and recommendations to internal and external customers. Moreover, you are a professional who is self-directed and thrives working in a fast-paced, collaborative environment, in which expectations are high both for the quality and speed of work.\nIn the role of a Data Scientist you will wear many hats and your skills will be crucial in the following:\nWrite SQL, R, Python programs to access, clean, and transform required data prior to analysis and reporting\nConsult to and collaborate with analytics and client reporting team members to ensure appropriate data is analyzed and that results are provided in a format consistent with standard and customized client reporting services\nTroubleshoot and perform data audits to ensure and improve data integrity; investigate and resolve data discrepancies\nPlan and manage data analytic and reporting process to ensure the projects remain on schedule\nConduct ad hoc analysis as required using varied analytical tools and techniques\nSupport Client Success, sales and Marketing staff with direct communication with Virgin Pulse clients and prospects regarding the results of the analysis\nAchieve annual Key Performance Indicator objectives, which can include report volumes and scope, internal and external client satisfaction, introducing new areas of data and analysis, and influencing company product and process decisions\n\nWhat you bring to the team.\nIn order to represent the best of what we have to offer, you come to us with a multitude of positive attributes including:\nA bachelor’s degree in statistics, computer science, economics, or related field; Master’s degree is a plus\nA minimum of four years of work experience in a similar position\nExperience with data and analytic programming languages such as SQL, R, Python\nExperience with data visualization tools and techniques preferred\nExperience with producing and delivering results using varied media (i.e., multiple MS office formats, dashboards/visualization tools, and potentially other formats)\nExperience in employee health management/health engagement industry preferred\n\nIn addition, you possess the following additional competencies and characteristics:\nStrong analytical skills, with an emphasis on quantitative analysis, descriptive and inferential statistics\nExpertise in statistical analytical software, or the ability to learn through prior experience tools such as SAS, Stata, R, SPSS or similar statistical software\nStrong consulting, communication and presentation skills\nAdvanced R, SQL and database programming skills, experience with MS SQL Server, RedShift, Postgres, and Cassandra/NOSQL databases\nExperience working with large-scale datasets and multiple projects simultaneously\nCreative energy, self-starter, works equally well independently and collaboratively'
]

In [8]:
def vectorize_words(document, stopwords):
    words = stopwords_cleaner(document, stopwords)
    word_counts = Counter(words)
    
    return word_counts

In [9]:
dfs = []
for i, listing in enumerate(job_listings):
    df = pd.DataFrame([vectorize_words(listing, stopwords)])
    df['listing_id'] = i
    df = df.set_index('listing_id')
    dfs.append(df)

df = pd.concat(dfs, sort=True)
df = df.fillna(0)
df = df.apply(lambda x: x.astype(int))

df.head()

Unnamed: 0_level_0,abides,ability,able,academia,academic,accept,acceptable,access,accesses,accommodations,...,without,work,working,workplace,works,world,write,year,years,yet
listing_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,1,0,0,2,0,0,0,0,0,...,0,2,0,0,0,0,0,2,2,1
1,1,1,0,1,0,1,4,1,0,1,...,0,7,0,0,1,3,0,3,15,1
2,0,1,1,0,0,0,0,1,1,0,...,1,3,2,2,1,4,1,0,1,0


# 2. Regex + Pandas Practice

Load the contents of the following text file into your notebook: https://raw.githubusercontent.com/CoreyMSchafer/code_snippets/master/Python-Regular-Expressions/data.txt

## Turn the "unstructured" .txt file into a "structured" dataframe
Once you have read in the file's contents your task is to get this unstructured text data into a dataframe with the following headers:

- First Name
- Last Name
- Email
- Phone Number
- Street Address
- City
- State
- Zipcode

In [17]:
r = requests.get('https://raw.githubusercontent.com/'
                 'CoreyMSchafer/code_snippets/master/'
                 'Python-Regular-Expressions/data.txt')
contents = r.text

df = pd.DataFrame(contents.split('\n\n'))
df[['Name', 'Phone Number', 'Address', 'Email']] = \
    df[0].str.split('\n', expand=True).iloc[:, 0:4]

df[['First Name', 'Last Name']] = df['Name'].str.split(' ', expand=True)

df[['Street Address', 'City', 'State', 'Zipcode']] = \
    df['Address'].str.extract(r'(.+), ([a-zA-Z ]+) ([A-Z]+) ([0-9]+)', expand=True)

df = df[['First Name', 'Last Name', 'Email', 'Phone Number',
         'Street Address', 'City', 'State', 'Zipcode']]

df.head()

Unnamed: 0,First Name,Last Name,Email,Phone Number,Street Address,City,State,Zipcode
0,Dave,Martin,davemartin@bogusemail.com,615-555-7164,173 Main St.,Springfield,RI,55924
1,Charles,Harris,charlesharris@bogusemail.com,800-555-5669,969 High St.,Atlantis,VA,34075
2,Eric,Williams,laurawilliams@bogusemail.com,560-555-5153,806 1st St.,Faketown,AK,86847
3,Corey,Jefferson,coreyjefferson@bogusemail.com,900-555-9340,826 Elm St.,Epicburg,NE,10671
4,Jennifer,Martin-White,jenniferwhite@bogusemail.com,714-555-7405,212 Cedar St.,Sunnydale,CT,74983


# 3. Web Scraping + Pandas

Scrape the unordered list of information about Ohio University Presidents' salaries from this article: 
[Ohio Private University President's Salaries](https://www.cleveland.com/metro/2017/12/case_western_reserve_university_president_barbara_snyders_base_salary_and_bonus_pay_tops_among_private_colleges_in_ohio.html)

Get the data from this webpage into a dataframe with the following headers:

- First Name
- Last Name
- School
- Salary

Salary information should be stored as an integer and not have "$" or commas ","


In [12]:
url = ('https://www.cleveland.com/metro/2017/12/'
       'case_western_reserve_university_president'
       '_barbara_snyders_base_salary_and_bonus_'
       'pay_tops_among_private_colleges_in_ohio.html')

r = requests.get(url)

soup = BeautifulSoup(r.text, 'html.parser')
# print(soup.prettify())

items = soup.find_all(class_="article__unordered-list-item")
df = pd.DataFrame([item.text for item in items])

df[['Name', 'School', 'Salary']] = \
    df[0].str.extract(r'([a-zA-Z ]+), *([a-zA-Z ]+).+\$([0-9,]+)', expand=True)

df = df.drop(columns=[0])

df.head()

Unnamed: 0,Name,School,Salary
0,Grant Cornwell,College of Wooster,911651
1,Marvin Krislov,Oberlin College,829913
2,Mark Roosevelt,Antioch College,507672
3,Laurie Joyner,Wittenberg University,463504
4,Richard Giese,University of Mount Union,453800


# 4. Stretch Goals

* Write a web scraper that can scrape "Data Scientist" job listings from indeed.com.
* Look ahead to some of the topics from later this week:
 - Tokenization
 - Stemming
 - Lemmatization
 - Chunking
 - Part of Speech Tagging
 - Named Entity Recognition
 - Document Classification