# Web Scraping for Indeed.com & Predicting Salaries

In this project, we will practice two major skills: collecting data by scraping a website and then building a binary classifier.

We are going to collect salary information on data science jobs in a variety of markets. Then using the location, title, and summary of the job we will attempt to predict the salary of the job. For job posting sites, this would be extraordinarily useful. While most listings DO NOT come with salary information (as you will see in this exercise), being to able extrapolate or predict the expected salaries from other listings can help guide negotiations.

Normally, we could use regression for this task; however, we will convert this problem into classification and use a random forest classifier, as well as another classifier of your choice; either logistic regression, SVM, or KNN. 

- **Question**: Why would we want this to be a classification problem?
- **Answer**: While more precision may be better, there is a fair amount of natural variance in job salaries - predicting a range be may be useful.

Therefore, the first part of the assignment will be focused on scraping Indeed.com. In the second, we'll focus on using listings with salary information to build a model and predict additional salaries.

### Scraping job listings from Indeed.com

We will be scraping job listings from Indeed.com using BeautifulSoup. Luckily, Indeed.com is a simple text page where we can easily find relevant entries.

First, look at the source of an Indeed.com page: (http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10")

Notice, each job listing is underneath a `div` tag with a class name of `result`. We can use BeautifulSoup to extract those. 

#### Set up a request (using requests) to the URL below. Use BeautifulSoup to parse the page and extract all results (HINT: Look for div tags with class name result)
The URL here has many query parameters
- q for the job search
- This is followed by "+20,000" to return results with salaries (or expected salaries >$20,000)
- l for a location
- start for what result number to start on

In [227]:
URL = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

In [228]:
import urllib
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import re

Let's look at one result more closely. A single result looks like
```JSON
<div class=" row result" data-jk="2480d203f7e97210" data-tn-component="organicJob" id="p_2480d203f7e97210" itemscope="" itemtype="http://schema.org/JobPosting">
<h2 class="jobtitle" id="jl_2480d203f7e97210">
<a class="turnstileLink" data-tn-element="jobTitle" onmousedown="return rclk(this,jobmap[0],1);" rel="nofollow" target="_blank" title="AVP/Quantitative Analyst">AVP/Quantitative Analyst</a>
</h2>
<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">
<span itemprop="name">
<a href="/cmp/Alliancebernstein?from=SERP&campaignid=serp-linkcompanyname&fromjk=2480d203f7e97210&jcid=b374f2a780e04789" target="_blank">
    AllianceBernstein</a></span>
</span>
<tr>
<td class="snip">
<nobr>$117,500 - $127,500 a year</nobr>
<div>
<span class="summary" itemprop="description">
C onduct quantitative and statistical research as well as portfolio management for various investment portfolios. Collaborate with Quantitative Analysts and</span>
</div>
</div>
</td>
</tr>
</table>
</div>
```

While this has some more verbose elements removed, we can see that there is some structure to the above:
- The salary is available in a nobr element inside of a td element with class='snip.
- The title of a job is in a link with class set to jobtitle and a data-tn-element="jobTitle.
- The location is set in a span with class='location'.
- The company is set in a span with class='company'.

In [229]:
#html = requests.get(URL)
#b = BeautifulSoup(html.text)
#print b.prettify()

In [230]:
# html = urllib.urlopen(url).read()
# soup = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")

In [231]:
# def get_page(url):
#     html = urllib.urlopen(url).read()
#     return html
#     #html = requests.get(ull)

In [232]:
# html = requests.get(URL)
# soup = BeautifulSoup(html)
# df = pd.DataFrame(columns=["Title","Location","Company","Salary", "Synopsis"])
# for each in soup.findAll('div', {'class':"  row  result" }):
#     try: 
#         title = each.find('hs', {'class':'jobtitle'}).text
#         print title
#     except:
#         title = 'asdfdsafdsafdsa'
#         print title

In [233]:
# soup = get_page(URL)
# soup

In [234]:
#def get_story(content):
   # b = BeautifulSoup(html.text,'html.parser')
   # return b.prettify()

In [235]:
#for each in soup.findAll(class_='jobtitle'):
#    print each.text

# def get_job(contents):
#     soup = BeautifulSoup(contents)
#     title = []
#     for each in soup.findAll(class_='jobtitle'):
#         title.append(each.text.replace('\n', ''))
#     return title

#def get_stories(content):
    #soup = BeautifulSoup(content)
    #titles = []

   # for td in soup.findAll("td", { "class":"title" }):
       # a_element = td.find("a")
        #if a_element:
            #titles.append(a_element.string)

    #return titles    
    #soup.find_all(class_='rest-row-name-text')   
#soupp = soup.find_all(class_='jobtitle')

In [236]:
# get_job(soup)

In [237]:
# def job_find(section):
#     try:
#         soup.find(class_='jobtitle')
#     except:
        

## Write 4 functions to extract these items (one function for each): location, company, job title, and salary.¶
Example
```python
def extract_location_from_result(result):
    return result.find ...
```

##### - Make sure these functions are robust and can handle cases where the data/field may not be available.
>- Remember to check if a field is empty or None for attempting to call methods on it
>- Remember to use try/except if you anticipate errors.

- **Test** the functions on the results above and simple examples

In [238]:
### Rithika suggested I merge all of the find_alls into one large function, so the code below was created with her assistance. 

def parse(url):
    html = requests.get(url)
    soup = BeautifulSoup(html.content, 'html.parser', from_encoding="utf-8")
    df = pd.DataFrame(columns=["Title","Location","Company","Salary", "Synopsis"])
    for each in soup.find_all(class_= "result" ):
        try: 
            title = each.find(class_='jobtitle').text.replace('\n', '')
        except:
            title = 'None'
        try:
            location = each.find('span', {'class':"location" }).text.replace('\n', '')
        except:
            location = 'None'
        try: 
            company = each.find(class_='company').text.replace('\n', '')
        except:
            company = 'None'
        try:
            salary = each.find('span', {'class':'no-wrap'}).text
        except:
            salary = 'None'
        synopsis = each.find('span', {'class':'summary'}).text.replace('\n', '')
        df = df.append({'Title':title, 'Location':location, 'Company':company, 'Salary':salary, 'Synopsis':synopsis}, ignore_index=True)
    return df

In [239]:
parse(URL)

Unnamed: 0,Title,Location,Company,Salary,Synopsis
0,Data Scientist - Big Data & Analytics,"New York, NY 10154",KPMG,,KPMG is currently seeking a Data Scientist - B...
1,Data Scientist,"New York, NY 10154",KPMG,,KPMG is currently seeking a Data Scientist to ...
2,Scientist - Business Process Modeling and Simu...,"New York, NY",AIG,,AIG Science is the hub for decision sciences a...
3,Entry Level – Research Analyst/Editor/Content ...,"New York, NY 10017 (Midtown area)","XG Consultants Group, Inc.",$15 an hour,Job Overview: XG Consultants Group is looking ...
4,Machine Learning Engineer,"Armonk, NY",IBM,,Create tools for data ingestion and processing...
5,Associate Data Scientist,"New York, NY",ITL USA,,Infosys – Analytics – US - Senior Associate - ...
6,Data Scientist,"New York, NY",MetroPlus Health Plan,"$90,000 - $115,000 a year",The Data Scientist will be tasked with leading...
7,Data Scientist,"New York, NY",Blue State Digital,,Use data to build a better world. You will be ...
8,"Database Analyst, Junior","New York, NY",NYU Langone Health,,Collects and enters data into data registries ...
9,Data Scientist,"New York, NY 10012 (Little Italy area)",Meetup,,"3+ years working with data analytics, data war..."


In [240]:
#html = requests.get(URL)
#b = BeautifulSoup(html.text)
#print b.prettify()
#b is result 

In [241]:
#def extract_title(result):
    #return [each.text.replace('\n', '') for each in result.find_all(class_='jobtitle')]

In [242]:
#extract_title(b)

In [243]:
# def get_location(contents):
#     soup = BeautifulSoup(contents)
#     location = []
#     for each in soup.findAll('span', {'class':"location" }):
#         location.append(each.text.replace('\n', ''))
#     return location

In [244]:
# get_location(soup)

In [245]:
#def extract_location(result):
    #return [each.text for each in result.find_all('span', {'class':"location" })]
#return [each.text for each in result.find_all('span', {'class':"location" })]

In [246]:
#def extract_locationn(result):
    #return [each.text for each in result.find_all('span', {'itemprop': 'addressLocality'})]
#    locs = []
#    for each in result.find_all('span', {'itemprop': 'addressLocality'}):
#        if each.text:
#            locs.append(each)
#        else:
#            locs.append('NaN')
#    return locs
    #return result.find_all('span', {'itemprop': 'addressLocality'})

In [247]:
#locs = extract_location(b)
#locs

In [248]:
#extract_locationn(b)

In [249]:
# def get_company(contents):
#     soup = BeautifulSoup(contents)
#     company = []
#     for each in soup.find_all(class_='company'):
#         company.append(each.text.replace('\n', ''))
#     return company

In [250]:
# get_company(soup)

In [251]:
#def extract_company(result):
    #return [each.text.replace('\n', '') for each in result.find_all(class_='company')]
#return [each.text for each in result.find_all(class_='company')]

In [252]:
#coms = extract_company(b)
#coms

In [253]:
# def get_salary(contents):
#     soup = BeautifulSoup(contents)
#     salary = []
#     for each in soup:  
#         try:
#             a = soup.find_all('span', class_='no-wrap')
#             salary.append(a.renderContents)
#         except: 
#             salary.append('None')
#     return salary
       
# def get_salary(contents):
#     soup = BeautifulSoup(contents)
#     salary = []
#     for each in soup.findAll('div', {'class':"  row  result" }):
#         try: 
#             soup.find('span', class_='no-wrap')
#             salary.append(each.text)
#         except: 
#             salary.append('NA')
#     return salary
    
    
    
        #match = re.search('[+-]?[0-9]{1,3}(?:,?[0-9]{3})*(?:\.[0-9]{2})?', each.text)
        #if match:
            #salary.append(match)
        #elif mathc = each.findAll(r'', each.text)
        #else:
            #salary.append(None)
    #return salary

In [254]:
# get_salary(soup)

In [255]:
# def extract_salary(result):
#     salaries = []
#     for each in result.find_all('td', class_='snip'):
#         match = re.find_all(r'[\$][+-]?[0-9]{1,3}(?:,?[0-9]{3})*(?:\.[0-9]{2})', each.text)
#         if match:
#             salaries.append(match)
#         else:
#             salaries.append(None)
#     return salaries
    #return [each.text.replace('\n', '') for each in result.find_all(class_='snip')]

#for booking in html.find_all('div', {'class':'booking'}):
    # match all digits
    #match = re.search(r'\d+', booking.text)

In [256]:
# extract_salary(b)

Now, to scale up our scraping, we need to accumulate more results. We can do this by examining the URL above.
- "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

There are two query parameters here we can alter to collect more results, the l=New+York and the start=10. The first controls the location of the results (so we can try a different city). The second controls where in the results to start and gives 10 results (thus, we can keep incrementing by 10 to go further in the list).
##### Complete the following code to collect results from multiple cities and starting points.
- Enter your city below to add it to the search
- Remember to convert your salary to U.S. Dollars to match the other cities if the currency is different

In [258]:
YOUR_CITY = 'Washington%2C+DC'

In [266]:
url_template = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l={}&start={}"
max_results_per_city = 1000 # Set this to a high-value (5000) to generate more results. 
# Crawling more results, will also take much longer. First test your code on a small number of results and then expand.

i = 0

results = []
df_more = pd.DataFrame(columns=["Title","Location","Company","Salary", "Synopsis"])

for city in set(['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Seattle', 
    'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
    'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', YOUR_CITY, 
    'Charlottesville', 'Richmond', 'Baltimore', 'Harrisonburg', 'San+Antonio', 'San+Diego', 'San+Jose'
    'Austin', 'Jacksonville', 'Indianapolis', 'Columbus', 'Fort+Worth', 'Charlotte', 'Detroit', 'El+Paso', 
    'Memphis', 'Boston', 'Nashville', 'Louisville', 'Milwaukee', 'Las+Vegas', 'Albuquerque', 'Tucson', 
    'Fresno', 'Sacramento', 'Long+Beach', 'Mesa', 'Virginia+Beach', 'Norfolk', 'Atlanta', 'Colorado+Springs',
    'Raleigh', 'Omaha', 'Oakland', 'Tulsa', 'Minneapolis', 'Cleveland', 'Wichita', 'Arlington', 'New+Orleans', 
    'Bakersfield', 'Tampa', 'Honolulu', 'Anaheim', 'Aurora', 'Santa+Ana', 'Riverside', 'Corpus+Christi', 'Pittsburgh', 
    'Lexington', 'Anchorage', 'Cincinnati', 'Baton+Rouge', 'Chesapeake', 'Alexandria', 'Fairfax', 'Herndon',
    'Reston', 'Roanoke']):
    for start in range(0, max_results_per_city, 10):
        # Grab the results from the request (as above)
        url = url_template.format(city, start)
        # Append to the full set of results
        html = requests.get(url)
        soup = BeautifulSoup(html.content, 'html.parser', from_encoding="utf-8")
        for each in soup.find_all(class_= "result" ):
            try: 
                title = each.find(class_='jobtitle').text.replace('\n', '')
            except:
                title = None
            try:
                location = each.find('span', {'class':"location" }).text.replace('\n', '')
            except:
                location = None
            try: 
                company = each.find(class_='company').text.replace('\n', '')
            except:
                company = None
            try:
                salary = each.find('span', {'class':'no-wrap'}).text
            except:
                salary = None
            try:
                synopsis = each.find('span', {'class':'summary'}).text.replace('\n', '')
            except:
                synopsis = None
            df_more = df_more.append({'Title':title, 'Location':location, 'Company':company, 'Salary':salary, 'Synopsis':synopsis}, ignore_index=True)
            i += 1
            if i % 1000 == 0:  # Ram helped me build this counter to see how many. You can visibly see Ram's vernacular in the print statements.
                print('You have ' + str(i) + ' results. ' + str(df_more.dropna().drop_duplicates().shape[0]) + " of these aren't rubbish.")    
            

You have 1000 results. 13 of these aren't rubbish.
You have 2000 results. 61 of these aren't rubbish.
You have 3000 results. 66 of these aren't rubbish.
You have 4000 results. 243 of these aren't rubbish.
You have 5000 results. 432 of these aren't rubbish.
You have 6000 results. 882 of these aren't rubbish.
You have 7000 results. 1415 of these aren't rubbish.
You have 8000 results. 1416 of these aren't rubbish.
You have 9000 results. 1862 of these aren't rubbish.
You have 10000 results. 2037 of these aren't rubbish.
You have 11000 results. 2190 of these aren't rubbish.
You have 12000 results. 2257 of these aren't rubbish.
You have 13000 results. 2274 of these aren't rubbish.
You have 14000 results. 2307 of these aren't rubbish.
You have 15000 results. 2307 of these aren't rubbish.
You have 16000 results. 2332 of these aren't rubbish.
You have 17000 results. 2379 of these aren't rubbish.
You have 18000 results. 2488 of these aren't rubbish.
You have 19000 results. 2488 of these aren't r

In [267]:
#df_more.to_csv('Indeed_Project_3_df_more_short.csv', encoding='utf-8')

In [844]:
df_more = pd.read_csv('/Users/aakashtandel/Desktop/Indeed_Project_3_df_more_short.csv')

In [845]:
df_more.shape

(100465, 6)

In [846]:
df_more[df_more.Salary != 'None'].shape

(7695, 6)

In [847]:
df_more.tail()

Unnamed: 0.1,Unnamed: 0,Title,Location,Company,Salary,Synopsis
100460,100460,"Manager/Sr. Manager, Diagnostics & Biomarkers","Bothell, WA 98021",Seattle Genetics,,"In collaboration with Bioinformatics, Data Man..."
100461,100461,"Postdoctoral Research Scientist, Digital Anthr...","Redmond, WA",Oculus VR,,"3+ years experience with MAXQDA, Noldus Observ..."
100462,100462,Product Scientist,"Seattle, WA",Indeed,,Work alongside other data scientists and softw...
100463,100463,Data Scientist - Operations,"Seattle, WA",Redfin,,"As a Data Scientist, focused on our Brokerage ..."
100464,100464,Data Scientist II,"Seattle, WA 98119",Big Fish Games,,Understand and consume data from dimensional m...


In [848]:
df_more.drop('Unnamed: 0', inplace=True, axis=1)

In [849]:
#df.Salary.value_counts(dropna=False)

In [850]:
df_more.head()

Unnamed: 0,Title,Location,Company,Salary,Synopsis
0,Environmental Scientist,"Blacksburg, VA",CyberCoders,,Environmental Scientist If you are an Environm...
1,Systems Administrator,"Forest, VA",Innerspec Technologies,,Manage backup and restore services to ensure t...
2,Senior Manager,United States,Exponent,,"Providing case management, data processing, an..."
3,Software Engineering Specialist,"Roanoke, VA 24019",General Electric,,You will work with a group of energized and fo...
4,Sr Staff Software Engineer,"Roanoke, VA 24019",General Electric,,"Architects, Data Scientists, Businesses & Prod..."


In [851]:
df_more.drop_duplicates

<bound method DataFrame.drop_duplicates of                                                     Title  \
0                                 Environmental Scientist   
1                                   Systems Administrator   
2                                          Senior Manager   
3                         Software Engineering Specialist   
4                              Sr Staff Software Engineer   
5                                 Staff Software Engineer   
6       Pharmaceutical Sales Specialist -- Hospital Ca...   
7                  Staff Software Engineer - UI Front End   
8       Senior Software Engineer - Python (relo required)   
9                       Russian Linguist/Research Analyst   
10                       Senior Control Engineer, LabVIEW   
11      Occupational Medicine Consultant (Relocation t...   
12              Marketing Integration & Opportunity (MIO)   
13                                  Systems Administrator   
14                                        

In [852]:
df_more.shape

(100465, 5)

#### Use the functions you wrote above to parse out the 4 fields - location, title, company and salary. Create a dataframe from the results with those 4 columns.

In [835]:
# import pandas as pd
# indeed_frame = pd.DataFrame(columns=["title","location","company","salary"])

In [725]:
# title = []
# location = []
# company = []
# salary = []
# for each in results:
#     title.append(extract_title(each))
#     location.append(extract_location(each))
#     company.append(extract_company(each))
#     salary.append(extract_salary(each))
# #dc_eats.loc[len(dc_eats)]=[name, location, compay, bookings]
# indeed_frame = {'title':title, 'location':location, 'company':company, 'salary':salary}

In [726]:
# indeed_frame

In [727]:
# def parser(url):
#     for each in url.find_all('div', {'class':'  row  result'}):
#         titlee = each.find_all(class_='jobtitle')
#         title = titlee.text.replace('\n', '')
#         locationn = result.find_all('span', {'class':"location" })
#         location = locationn.text
#         companyy = result.find_all(class_='company')
#         company = companyy.text.replace('\n', '')
#         match = re.findall(r'[\$][+-]?[0-9]{1,3}(?:,?[0-9]{3})*(?:\.[0-9]{2})', each.text)
#         salary = []
#         if match:
#             salary.append(match)
#         else:
#             salary.append(None)
#         indeed_frame.loc[len(indeed_frame)]=[name, location, company, salary]
#     return indeed_frame.head()

In [681]:
# url = 'https://www.indeed.com/jobs?q=data+scientist+$20,000&l=New+York&start=10' 
# html = requests.get(url)
# soup = BeautifulSoup(html.text, 'html.parser')
# #soup

In [682]:
# loop through each entry
#for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    # grab the name
#    name = entry.find('span', {'class': 'rest-row-name-text'}).text
    # grab the location
#    location = entry.find('span', {'class': 'rest-row-meta--location rest-row-meta-text'}).renderContents()
    # grab the price
#    price = entry.find('div', {'class': 'rest-row-pricing'}).find('i').renderContents().count('$')
    # try to find the number of bookings
#    try:
#        temp = entry.find('div', {'class':'booking'}).text
#        match = re.search(r'\d+', temp)
#        if match:
#            bookings = match.group()
#    except:
#        bookings = 'NA'
    # add to df
#    dc_eats.loc[len(dc_eats)]=[name, location, price, bookings]

Lastly, we need to clean up salary data. 

1. Only a small number of the scraped results have salary information - only these will be used for modeling.
1. Some of the salaries are not yearly but hourly or weekly, these will not be useful to us for now
1. Some of the entries may be duplicated
1. The salaries are given as text and usually with ranges.

#### Find the entries with annual salary entries, by filtering the entries without salaries or salaries that are not yearly (filter those that refer to hour or week). Also, remove duplicate entries

In [683]:
#df.shape

In [684]:
#df.drop_duplicates

In [685]:
#df.shape

In [686]:
#df_dropped = df[df.Salary.str.contains("None") == False]

In [853]:
df_more = df_more[df_more.Salary.str.contains("hour") == False]
df_more = df_more[df_more.Salary.str.contains("month") == False]

In [854]:
#df_dropped.shape

In [855]:
df_more[60:120]

Unnamed: 0,Title,Location,Company,Salary,Synopsis
60,Senior Software Engineer - Python (relo required),United States,Spoken Communications,,You'll find yourself collaborating with world-...
61,"Senior Control Engineer, LabVIEW",United States,Tri Alpha Energy,,"Networking using Ethernet, EtherNet/IP, Data D..."
62,Occupational Medicine Consultant (Relocation t...,United States,Saudi Aramco,,Analyze and interpret statistical data for eva...
63,Marketing Integration & Opportunity (MIO),United States,Yahoo,,We strive for industry-leading work in a diver...
64,Systems Administrator,"Forest, VA",Innerspec Technologies,,Manage backup and restore services to ensure t...
65,Senior Manager,United States,Exponent,,"Providing case management, data processing, an..."
66,Software Engineering Specialist,"Roanoke, VA 24019",General Electric,,You will work with a group of energized and fo...
67,Sr Staff Software Engineer,"Roanoke, VA 24019",General Electric,,"Architects, Data Scientists, Businesses & Prod..."
68,Staff Software Engineer,"Roanoke, VA 24019",General Electric,,"Architects, Data Scientists, Businesses & Prod..."
69,Pharmaceutical Sales Specialist -- Hospital Ca...,"Roanoke, VA",AstraZeneca,,Experience working with Medical Information Sc...


In [690]:
#df_dropped  = df_dropped.drop('Unnamed: 0',1)

In [691]:
#df_dropped.head()

In [692]:
#df_dropped = df_dropped.reset_index()
#df_dropped.head()

In [693]:
#df_drop  = df_dropped.drop('index',1)

In [694]:
#df_drop.head()

In [695]:
#df_dropped.drop_duplicates

In [696]:
#df_drop.shape

In [697]:
#df_dropped[50:100]

In [856]:
df_more = df_more[df_more.Salary != 'None'].drop_duplicates().dropna()

In [857]:
df_more.shape

(395, 5)

I have 572 unique jobs with salary data. 

In [700]:
#df_dropped[150:200]

In [None]:
# df_dropped = df_dropped['Salary'].str.strip()
# df_dropped = df_dropped['Title'].str.strip()
# df_dropped = df_dropped['Location'].str.strip()
# df_dropped = df_dropped['Company'].str.strip()
# df_dropped = df_dropped['Synopsis'].str.strip()
# df_dropped.head()

In [510]:
df_drop.Company.dtype

NameError: name 'df_drop' is not defined

In [None]:
# df_no_desc = df_dropped.drop('Synopsis', axis=1)
# df_no_desc[60:70]

In [None]:
# df_no_desc.drop_duplicates()

In [None]:
# df_no_desc

In [None]:
 for each in dataframe:
        each = each.strip('a year')

#### Write a function that takes a salary string and converts it to a number, averaging a salary range if necessary

In [858]:
print (df_more.shape)
df_more = df_more[df_more.Salary.str.contains("hour") == False]
df_more = df_more[df_more.Salary.str.contains("month") == False]
print (df_more.shape)
df_more.head()

(395, 5)
(395, 5)


Unnamed: 0,Title,Location,Company,Salary,Synopsis
1263,Environmental Consultant,"Nashville, TN 37220",LP Environmental,"$50,000 - $100,000 a year",We are seeking a mid-level Environmental Scien...
4001,Research Analyst,"Chicago, IL",Illinois Criminal Justice Information Auth...,"$55,000 - $60,000 a year",Produce analyses of crime trends and provide d...
4176,Data Scientist,"Chicago, IL",Smith Hanley Associates,"$80,000 - $130,000 a year",Requirements of the Data Scientist:. Prominent...
4177,Lead Data Scientist,"Chicago, IL",Workbridge Associates,"$140,000 a year",A major healthcare corporation located right i...
4184,"Lead Data Scientist (Tableau, RShiny, D3)","Chicago, IL",Workbridge Associates,"$130,000 a year",Hiring for a Senior-Lead Data Scientist for a ...


In [859]:
def salary_stripper(dataframe, column):
    dataframe[str(column)] = dataframe[str(column)].replace({'\$':''}, regex = True)
    dataframe[str(column)].replace(regex=True,inplace=True,to_replace=r'\D',value=r' ')
    dataframe[str(column)] = dataframe[str(column)].str.replace(' ',',')
    dataframe = dataframe.join(dataframe[str(column)].str.split(',,,', 1, expand=True).rename(columns={0:'Low', 1:'High'}))
    dataframe['Low'] = dataframe['Low'].str.replace(',','')
    dataframe['Low'] = dataframe['Low'].astype('float64')
    dataframe.drop(str(column), axis=1, inplace=True)
    dataframe['High'] = dataframe['High'].str.replace(',','')
    dataframe['High'] = dataframe['High'].apply(pd.to_numeric)
    dataframe['Average'] = dataframe[['Low', 'High']].mean(axis=1)
    return dataframe.head()

In [860]:
salary_stripper(df_more, 'Salary')

Unnamed: 0,Title,Location,Company,Synopsis,Low,High,Average
1263,Environmental Consultant,"Nashville, TN 37220",LP Environmental,We are seeking a mid-level Environmental Scien...,50000.0,100000.0,75000.0
4001,Research Analyst,"Chicago, IL",Illinois Criminal Justice Information Auth...,Produce analyses of crime trends and provide d...,55000.0,60000.0,57500.0
4176,Data Scientist,"Chicago, IL",Smith Hanley Associates,Requirements of the Data Scientist:. Prominent...,80000.0,130000.0,105000.0
4177,Lead Data Scientist,"Chicago, IL",Workbridge Associates,A major healthcare corporation located right i...,140000.0,,140000.0
4184,"Lead Data Scientist (Tableau, RShiny, D3)","Chicago, IL",Workbridge Associates,Hiring for a Senior-Lead Data Scientist for a ...,130000.0,,130000.0


In [822]:
df_more['Salary'] = df_more['Salary'].replace({'\$':''}, regex = True)
df_more['Salary'].replace(regex=True,inplace=True,to_replace=r'\D',value=r' ')
df_more['Salary'] = df_more['Salary'].str.replace(' ',',')
df_more = df_more.join(df_more['Salary'].str.split(',,,', 1, expand=True).rename(columns={0:'Salary Low', 1:'Salary High'}))
df_more['Salary Low'] = df_more['Salary Low'].str.replace(',','')
df_more['Salary High'] = df_more['Salary High'].str.replace(',','')
df_more['Salary Low'] = df_more['Salary Low'].astype('float64')
df_more.drop('Salary', axis=1, inplace=True)
df_more['Salary High'] = df_more['Salary High'].apply(pd.to_numeric)
df_more['Average'] = df_more[['Salary Low', 'Salary High']].mean(axis=1)
#df_more.drop('Salary High', axis=1, inplace=True)

In [823]:
df_more.head()

Unnamed: 0,Title,Location,Company,Synopsis,Salary Low,Salary High,Average
1263,Environmental Consultant,"Nashville, TN 37220",LP Environmental,We are seeking a mid-level Environmental Scien...,50000.0,100000.0,75000.0
4001,Research Analyst,"Chicago, IL",Illinois Criminal Justice Information Auth...,Produce analyses of crime trends and provide d...,55000.0,60000.0,57500.0
4176,Data Scientist,"Chicago, IL",Smith Hanley Associates,Requirements of the Data Scientist:. Prominent...,80000.0,130000.0,105000.0
4177,Lead Data Scientist,"Chicago, IL",Workbridge Associates,A major healthcare corporation located right i...,140000.0,,140000.0
4184,"Lead Data Scientist (Tableau, RShiny, D3)","Chicago, IL",Workbridge Associates,Hiring for a Senior-Lead Data Scientist for a ...,130000.0,,130000.0


In [794]:
for each in df_more['Salary High']:
    if each == '':
        each = None


In [802]:
df_more['Salary High'] = df_more['Salary High'].apply(pd.to_numeric)

In [803]:
df_more.head()

Unnamed: 0,Title,Location,Company,Synopsis,Salary Low,Salary High
1263,Environmental Consultant,"Nashville, TN 37220",LP Environmental,We are seeking a mid-level Environmental Scien...,50000.0,100000.0
4001,Research Analyst,"Chicago, IL",Illinois Criminal Justice Information Auth...,Produce analyses of crime trends and provide d...,55000.0,60000.0
4176,Data Scientist,"Chicago, IL",Smith Hanley Associates,Requirements of the Data Scientist:. Prominent...,80000.0,130000.0
4177,Lead Data Scientist,"Chicago, IL",Workbridge Associates,A major healthcare corporation located right i...,140000.0,
4184,"Lead Data Scientist (Tableau, RShiny, D3)","Chicago, IL",Workbridge Associates,Hiring for a Senior-Lead Data Scientist for a ...,130000.0,


In [804]:
df_more['Average'] = df_more[['Salary Low', 'Salary High']].mean(axis=1)

In [805]:
df_more.head()

Unnamed: 0,Title,Location,Company,Synopsis,Salary Low,Salary High,avg
1263,Environmental Consultant,"Nashville, TN 37220",LP Environmental,We are seeking a mid-level Environmental Scien...,50000.0,100000.0,75000.0
4001,Research Analyst,"Chicago, IL",Illinois Criminal Justice Information Auth...,Produce analyses of crime trends and provide d...,55000.0,60000.0,57500.0
4176,Data Scientist,"Chicago, IL",Smith Hanley Associates,Requirements of the Data Scientist:. Prominent...,80000.0,130000.0,105000.0
4177,Lead Data Scientist,"Chicago, IL",Workbridge Associates,A major healthcare corporation located right i...,140000.0,,140000.0
4184,"Lead Data Scientist (Tableau, RShiny, D3)","Chicago, IL",Workbridge Associates,Hiring for a Senior-Lead Data Scientist for a ...,130000.0,,130000.0


In [801]:
df_more.fillna()
df_more.head()

ValueError: must specify a fill method or value

In [795]:
df_more.head()

Unnamed: 0,Title,Location,Company,Synopsis,Salary Low,Salary High
1263,Environmental Consultant,"Nashville, TN 37220",LP Environmental,We are seeking a mid-level Environmental Scien...,50000.0,100000.0
4001,Research Analyst,"Chicago, IL",Illinois Criminal Justice Information Auth...,Produce analyses of crime trends and provide d...,55000.0,60000.0
4176,Data Scientist,"Chicago, IL",Smith Hanley Associates,Requirements of the Data Scientist:. Prominent...,80000.0,130000.0
4177,Lead Data Scientist,"Chicago, IL",Workbridge Associates,A major healthcare corporation located right i...,140000.0,
4184,"Lead Data Scientist (Tableau, RShiny, D3)","Chicago, IL",Workbridge Associates,Hiring for a Senior-Lead Data Scientist for a ...,130000.0,


In [758]:
df_more.drop('Salary', axis=1, inplace=True)

In [759]:
df_more.head()

Unnamed: 0,Title,Location,Company,Synopsis,Salary Low,Salary High
1263,Environmental Consultant,"Nashville, TN 37220",LP Environmental,We are seeking a mid-level Environmental Scien...,50000.0,100000.0
4001,Research Analyst,"Chicago, IL",Illinois Criminal Justice Information Auth...,Produce analyses of crime trends and provide d...,55000.0,60000.0
4176,Data Scientist,"Chicago, IL",Smith Hanley Associates,Requirements of the Data Scientist:. Prominent...,80000.0,130000.0
4177,Lead Data Scientist,"Chicago, IL",Workbridge Associates,A major healthcare corporation located right i...,140000.0,
4184,"Lead Data Scientist (Tableau, RShiny, D3)","Chicago, IL",Workbridge Associates,Hiring for a Senior-Lead Data Scientist for a ...,130000.0,


In [703]:
df_more['Salary'].replace(regex=True,inplace=True,to_replace=r'\D',value=r' ')
df_more['Salary'] = df_more['Salary'].str.replace(' ',',')

In [705]:
df_more['Salary'] = df_more['Salary'].str.replace(' ',',')

In [706]:
df_more.head()

Unnamed: 0,Title,Location,Company,Salary,Synopsis
1263,Environmental Consultant,"Nashville, TN 37220",LP Environmental,"50,000,,,100,000,,,,,,,",We are seeking a mid-level Environmental Scien...
4001,Research Analyst,"Chicago, IL",Illinois Criminal Justice Information Auth...,"55,000,,,60,000,,,,,,,",Produce analyses of crime trends and provide d...
4176,Data Scientist,"Chicago, IL",Smith Hanley Associates,"80,000,,,130,000,,,,,,,",Requirements of the Data Scientist:. Prominent...
4177,Lead Data Scientist,"Chicago, IL",Workbridge Associates,"140,000,,,,,,,",A major healthcare corporation located right i...
4184,"Lead Data Scientist (Tableau, RShiny, D3)","Chicago, IL",Workbridge Associates,"130,000,,,,,,,",Hiring for a Senior-Lead Data Scientist for a ...


Unnamed: 0,first,row
0,50,000 100 000
1,55,000 60 000
2,80,000 130 000
3,140,000
4,130,000


In [707]:
#df_more['Salary_low', 'Salary_high'] = df_more['Salary'].str.split(',,,')
df_more = df_more.join(df_more['Salary'].str.split(',,,', 1, expand=True).rename(columns={0:'Salary Low', 1:'Salary High'}))
#df.join(df['AB'].str.split('-', 1, expand=True).rename(columns={0:'A', 1:'B'}))
#df['AB_split'] = df['AB'].str.split('-')
df_more.head()

Unnamed: 0,Title,Location,Company,Salary,Synopsis,Salary Low,Salary High
1263,Environmental Consultant,"Nashville, TN 37220",LP Environmental,"50,000,,,100,000,,,,,,,",We are seeking a mid-level Environmental Scien...,50000,"100,000,,,,,,,"
4001,Research Analyst,"Chicago, IL",Illinois Criminal Justice Information Auth...,"55,000,,,60,000,,,,,,,",Produce analyses of crime trends and provide d...,55000,"60,000,,,,,,,"
4176,Data Scientist,"Chicago, IL",Smith Hanley Associates,"80,000,,,130,000,,,,,,,",Requirements of the Data Scientist:. Prominent...,80000,"130,000,,,,,,,"
4177,Lead Data Scientist,"Chicago, IL",Workbridge Associates,"140,000,,,,,,,",A major healthcare corporation located right i...,140000,",,,,"
4184,"Lead Data Scientist (Tableau, RShiny, D3)","Chicago, IL",Workbridge Associates,"130,000,,,,,,,",Hiring for a Senior-Lead Data Scientist for a ...,130000,",,,,"


In [710]:
#df_more['Salary Low'] = df_more['Salary Low'].str.replace(',','')
df_more['Salary High'] = df_more['Salary High'].str.replace(',','')

In [711]:
df_more.head()

Unnamed: 0,Title,Location,Company,Salary,Synopsis,Salary Low,Salary High
1263,Environmental Consultant,"Nashville, TN 37220",LP Environmental,"50,000,,,100,000,,,,,,,",We are seeking a mid-level Environmental Scien...,50000,100000.0
4001,Research Analyst,"Chicago, IL",Illinois Criminal Justice Information Auth...,"55,000,,,60,000,,,,,,,",Produce analyses of crime trends and provide d...,55000,60000.0
4176,Data Scientist,"Chicago, IL",Smith Hanley Associates,"80,000,,,130,000,,,,,,,",Requirements of the Data Scientist:. Prominent...,80000,130000.0
4177,Lead Data Scientist,"Chicago, IL",Workbridge Associates,"140,000,,,,,,,",A major healthcare corporation located right i...,140000,
4184,"Lead Data Scientist (Tableau, RShiny, D3)","Chicago, IL",Workbridge Associates,"130,000,,,,,,,",Hiring for a Senior-Lead Data Scientist for a ...,130000,


In [712]:
df_more['Salary Low'] = df_more['Salary Low'].astype('float64') 

In [713]:
df_more['Salary High'] = df_more['Salary High'].astype('float64') 

ValueError: could not convert string to float: 

In [714]:
df_more.head()

Unnamed: 0,Title,Location,Company,Salary,Synopsis,Salary Low,Salary High
1263,Environmental Consultant,"Nashville, TN 37220",LP Environmental,"50,000,,,100,000,,,,,,,",We are seeking a mid-level Environmental Scien...,50000.0,100000.0
4001,Research Analyst,"Chicago, IL",Illinois Criminal Justice Information Auth...,"55,000,,,60,000,,,,,,,",Produce analyses of crime trends and provide d...,55000.0,60000.0
4176,Data Scientist,"Chicago, IL",Smith Hanley Associates,"80,000,,,130,000,,,,,,,",Requirements of the Data Scientist:. Prominent...,80000.0,130000.0
4177,Lead Data Scientist,"Chicago, IL",Workbridge Associates,"140,000,,,,,,,",A major healthcare corporation located right i...,140000.0,
4184,"Lead Data Scientist (Tableau, RShiny, D3)","Chicago, IL",Workbridge Associates,"130,000,,,,,,,",Hiring for a Senior-Lead Data Scientist for a ...,130000.0,


In [598]:
df_more.Salary = [x.strip().replace(' ', ',') for x in df_more.Salary]
#list2 = [x for ind, x in enumerate(list1) if 4 > ind > 0]
#formatted.columns = [x.strip().replace(' ', '_') for x in formatted.columns]

In [612]:
df_more['Salary Low']  = [x.strip().replace(',', '') for x in df_more['Salary Low']]

KeyError: 'Salary Low'

In [861]:
#salary_stripper(df_more, 'Salary')
df_more.head()

Unnamed: 0,Title,Location,Company,Salary,Synopsis
1263,Environmental Consultant,"Nashville, TN 37220",LP Environmental,"50,000,,,100,000,,,,,,,",We are seeking a mid-level Environmental Scien...
4001,Research Analyst,"Chicago, IL",Illinois Criminal Justice Information Auth...,"55,000,,,60,000,,,,,,,",Produce analyses of crime trends and provide d...
4176,Data Scientist,"Chicago, IL",Smith Hanley Associates,"80,000,,,130,000,,,,,,,",Requirements of the Data Scientist:. Prominent...
4177,Lead Data Scientist,"Chicago, IL",Workbridge Associates,"140,000,,,,,,,",A major healthcare corporation located right i...
4184,"Lead Data Scientist (Tableau, RShiny, D3)","Chicago, IL",Workbridge Associates,"130,000,,,,,,,",Hiring for a Senior-Lead Data Scientist for a ...


In [871]:
df_more = pd.read_csv('/Users/aakashtandel/Desktop/Indeed_Project_3_df_more.csv', index_col=0)

In [872]:
df_more.head()

Unnamed: 0,Title,Location,Company,Salary,Synopsis
0,Systems Administrator,"Forest, VA",Innerspec Technologies,,Manage backup and restore services to ensure t...
1,Environmental Scientist,"Blacksburg, VA",CyberCoders,,Environmental Scientist If you are an Environm...
2,Senior Manager,United States,Exponent,,"Providing case management, data processing, an..."
3,Software Engineering Specialist,"Roanoke, VA 24019",General Electric,,You will work with a group of energized and fo...
4,Staff Software Engineer,"Roanoke, VA 24019",General Electric,,"Architects, Data Scientists, Businesses & Prod..."


In [873]:
df_more.shape[0] == df_more[df_more.Salary.str.contains('year')]

Unnamed: 0,Title,Location,Company,Salary,Synopsis
4769,False,False,False,False,False
13843,False,False,False,False,False
13887,False,False,False,False,False
13891,False,False,False,False,False
13938,False,False,False,False,False
14012,False,False,False,False,False
14037,False,False,False,False,False
14042,False,False,False,False,False
14075,False,False,False,False,False
14081,False,False,False,False,False


### Save your results as a CSV

In [None]:
# Export to csv


## Predicting salaries using Random Forests + Another Classifier

#### Load in the the data of scraped salaries

In [None]:
## YOUR CODE HERE

#### We want to predict a binary variable - whether the salary was low or high. Compute the median salary and create a new binary variable that is true when the salary is high (above the median)

We could also perform Linear Regression (or any regression) to predict the salary value here. Instead, we are going to convert this into a _binary_ classification problem, by predicting two classes, HIGH vs LOW salary.

While performing regression may be better, performing classification may help remove some of the noise of the extreme salaries. We don't _have_ to choose the `median` as the splitting point - we could also split on the 75th percentile or any other reasonable breaking point.

In fact, the ideal scenario may be to predict many levels of salaries, 

In [None]:
## YOUR CODE HERE

#### Thought experiment: What is the baseline accuracy for this model?

In [None]:
## YOUR CODE HERE

#### Create a Random Forest model to predict High/Low salary using Sklearn. Start by ONLY using the location as a feature. 

In [None]:
## YOUR CODE HERE

#### Create a few new variables in your dataframe to represent interesting features of a job title.
- For example, create a feature that represents whether 'Senior' is in the title or whether 'Manager' is in the title. 
- Then build a new Random Forest with these features. Do they add any value?
- After creating these variables, use count-vectorizer to create features based on the words in the job titles.
- Build a new random forest model with location and these new features included.

In [None]:
## YOUR CODE HERE

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 

In [None]:
## YOUR CODE HERE

#### Repeat the model-building process with a non-tree-based method.

In [None]:
## YOUR CODE HERE

### BONUS 

#### Bonus: Use Count Vectorizer from scikit-learn to create features from the job descriptions. 
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [None]:
## YOUR CODE HERE