# Project 4: Web Scraping Indeed.com & Predicting Salaries

In Project 4, we practice two major skills: collecting data via  web scraping and building a binary predictor with Logistic Regression.

We will collect salary information on data science jobs in a variety of markets. Using location, title, and job summary, we'll predict the salary of the job. For job posting sites, this is extraordinarily useful. While most listings DO NOT come with salary information (as you will see in this exercise), extrapolating expected salary can help guide negotiations.

Normally, we can use regression for this task; however, we will convert this problem into classification and use Logistic Regression.

- Q: Why would we want this to be a classification problem?
- A: While more precision may be better, there is a fair amount of natural variance in job salaries - predicting a range be may be useful.

Section one focuses on scraping Indeed.com; then we use listings with salary information to build a model and predict additional salaries.

### Scraping job listings from Indeed.com

Scrape job listings from Indeed.com using BeautifulSoup. Luckily, Indeed.com is a simple text page where we can easily find relevant entries. First, look at the source of an Indeed.com page: (http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10")

Notice, each job listing is underneath a `div` tag with a class name of `result`. We can use BeautifulSoup to extract.

#### Setup a request (using `requests`) to the URL below. Use BeautifulSoup to parse the page and extract all results (HINT: Look for div tags with class name result)

The URL here has many query parameters

- `q` for the job search
- This is followed by "+20,000" to return results with salaries (or expected salaries >$20,000)
- `l` for a location 
- `start` for what result number to start on

In [49]:
URL = 'http://www.indeed.com/q-data-scientist-l-Atlanta,-GA-jobs.html'

In [50]:
import requests
import bs4
from bs4 import BeautifulSoup

In [51]:
# read site in soup
r = requests.get(URL)
soup = BeautifulSoup(r.content, "lxml")

# Append to the full set of results
results = soup.findAll('div', { "class" : "result" })

Let's look at one result more closely. A single `result` looks like

```
<div class=" row result" data-jk="2480d203f7e97210" data-tn-component="organicJob" id="p_2480d203f7e97210" itemscope="" itemtype="http://schema.org/JobPosting">
<h2 class="jobtitle" id="jl_2480d203f7e97210">
<a class="turnstileLink" data-tn-element="jobTitle" onmousedown="return rclk(this,jobmap[0],1);" rel="nofollow" target="_blank" title="AVP/Quantitative Analyst">AVP/Quantitative Analyst</a>
</h2>
<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">
<span itemprop="name">
<a href="/cmp/Alliancebernstein?from=SERP&amp;campaignid=serp-linkcompanyname&amp;fromjk=2480d203f7e97210&amp;jcid=b374f2a780e04789" target="_blank">
    AllianceBernstein</a></span>
</span>
<tr>
<td class="snip">
<nobr>$117,500 - $127,500 a year</nobr>
<div>
<span class="summary" itemprop="description">
C onduct quantitative and statistical research as well as portfolio management for various investment portfolios. Collaborate with Quantitative Analysts and</span>
</div>
</div>
</td>
</tr>
</table>
</div>
```

While this has some of the more verbose elements removed, we can see that there is some structure to the above:
- The salary is available in a `nobr` element inside of a `td` element with `class='snip`.
- The title of a job is in a link with class set to `jobtitle` and a `data-tn-element="jobTitle`.  
- The location is set in a `span` with `class='location'`. 
- The company is set in a `span` with `class='company'`. 

### Write 4 functions to extract each item: location, company, job, and salary.

example: 
```python
def extract_location_from_result(result):
    return result.find ...
```

- Make sure these functions are robust and can handle cases where the data/field may not be available
- Test the functions on the results above

In [52]:
results[1].text

u'\nDATA SCIENTIST\n\n\n\n\n        Home Depot\n\n - \n21,054 reviews\n - Atlanta, GA 30354\n\n\n3 years of experience in data mining and statistical analysis. Insights from data to solve business problems....\n\n\n23 days ago window[\'sj_result_ed17685d2e969f9b\'] = {"showSource": false, "source": "The Home Depot", "loggedIn": false, "showMyJobsLinks": true,"undoAction": "unsave","relativeJobAge": "23 days ago","jobKey": "ed17685d2e969f9b", "myIndeedAvailable": true, "tellAFriendEnabled": false, "showMoreActionsLink": false, "resultNumber": 11, "jobStateChangedToSaved": false, "searchState": "q=data scientist&amp;l=Atlanta%2C+GA", "basicPermaLink": "http://www.indeed.com", "saveJobFailed": false, "removeJobFailed": false, "requestPending": false, "notesEnabled": false, "currentPage" : "serp", "sponsored" : true,"sponsorName" : "The Home Depot","reportJobButtonEnabled": false, "showMyJobsHired": false, "showSaveForSponsored": false};\n\n\n\n\nSponsored by The Home Depot\n'

In [53]:
def extract_text(el):
    if el:
        return el.text.strip()
    else:
        return ''
extract_text(results[1])

u'DATA SCIENTIST\n\n\n\n\n        Home Depot\n\n - \n21,054 reviews\n - Atlanta, GA 30354\n\n\n3 years of experience in data mining and statistical analysis. Insights from data to solve business problems....\n\n\n23 days ago window[\'sj_result_ed17685d2e969f9b\'] = {"showSource": false, "source": "The Home Depot", "loggedIn": false, "showMyJobsLinks": true,"undoAction": "unsave","relativeJobAge": "23 days ago","jobKey": "ed17685d2e969f9b", "myIndeedAvailable": true, "tellAFriendEnabled": false, "showMoreActionsLink": false, "resultNumber": 11, "jobStateChangedToSaved": false, "searchState": "q=data scientist&amp;l=Atlanta%2C+GA", "basicPermaLink": "http://www.indeed.com", "saveJobFailed": false, "removeJobFailed": false, "requestPending": false, "notesEnabled": false, "currentPage" : "serp", "sponsored" : true,"sponsorName" : "The Home Depot","reportJobButtonEnabled": false, "showMyJobsHired": false, "showSaveForSponsored": false};\n\n\n\n\nSponsored by The Home Depot'

In [54]:
# company
def get_company_from_result(result):
    return extract_text(result.find('span', {'class' : 'company'}))
get_company_from_result(results[1])

u'Home Depot'

In [55]:
# location
def get_location_from_result(result):
    return extract_text(result.find('span', {'class':'location'}))
get_location_from_result(results[1])

u'Atlanta, GA 30354'

In [56]:
# summary
def get_summary_from_result(result):
    return extract_text(result.find('span', {'class':'summary'}))
get_summary_from_result(results[1])

u'3 years of experience in data mining and statistical analysis. Insights from data to solve business problems....'

In [57]:
# title
def get_title_from_result(result):
    return extract_text(result.find('a', {'data-tn-element' : 'jobTitle'}))

get_title_from_result(results[2])

u'Data Scientist'

In [68]:
# get salary if exists
def get_salary_from_result(result):
    salary_table = result.find('td', {'class' : 'snip'})
    if salary_table:
        snip = salary_table.find('nobr')
        if snip:
            return snip.text.strip()   
    return None
get_salary_from_result(results[0])

In [105]:
# get when posted
def get_date_posted(result):
    return extract_text(result.find('span', {'class':'date'}))

# get whether sponsored
def get_is_sponsored(result):
    is_sponsored = extract_text(result.find('span', {'class':'sdn'}))
    if is_sponsored:
        return is_sponsored
    return None

# get number of reviews
def get_number_reviews(result):
    x = extract_text(result.find('span', {'class': 'slNoUnderline'}))
    if x:
        return x
    return None

    

print get_date_posted(results[0])
print get_is_sponsored(results[0])
print get_number_reviews(results[0])

30+ days ago
Sponsored
30 reviews


In [149]:
# get star rating
import re

def get_star_rating(result):
    find_rating = result.find('span',{'class':'rating'})
    if find_rating:
        search_in = str(find_rating)
        return 5*float(re.findall('width: (.*)px', search_in)[0])/60
    return None
get_star_rating(results[8])


3.55

In [290]:
ababa = results[0].find('a',{'class' : 'jobtitle turnstileLink'}).attrs['href']
ababa

'/pagead/clk?mo=r&ad=-6NYlbfkN0AbexXlh6WlNaC12RNLKcRQH8fywLm61v9KQllly0vTVrm9U0Iy0AOsYwOq9YOpDX03iprvWHw_SY6xCXG90mwLvOd8fb5BdJ-fu_-2tfp_KoWry1hPm7FaVRyBGPoeYEaNltu7W5i0j-mo3JRbnfv9fjDKHocl-PPaA54t_nU0LuKHYhZrcpw0vHpj47kOqopU6QSmWmYXvVWHNR2CzxLoO9Bbb36lQrm9dkXEzjOFO0F1O8yVPXCq8Wcl22b_eNIa__1cVuYp1K_qe98wAuNTG3gLmeuB2aE7Fqu6Gf2luR5kT_R23Op04FelalPdvo3zVdIfajcqvDI7RmIOMQp4Agc3YAT_Tzhz5WVO5KPp8yXrbTSBwnW8W5Y1Baywe5ZiB7685I_AcrFG5l2N50twz3gceaTSh5P5RH3T_0QSdi7gyR3ungTSaiRr-xDjlbhPXvvcS1iInH8TeXo3X64_&p=1&sk=&fvj=0'

In [292]:
# get link
def get_link(result):
    find_rating = result.find('a',{'class' : 'jobtitle turnstileLink'}, href=True)
    if find_rating:
        search_in = str(find_rating)
        return 'http://www.indeed.com'+ find_rating.attrs['href']
    find_rating = result.find('a',{'rel' : 'nofollow'})
    if find_rating:
        search_in = str(find_rating)
        return 'http://www.indeed.com'+ find_rating.attrs['href']
    return None
get_link(results[8])

'http://www.indeed.com/rc/clk?jk=cf742a5180d45499&fccid=ddf188c30da34688'

In [99]:
extract_text(results[0].find('span', {'class':'sdn'}))

u'Sponsored'

Now, to scale up our scraping, we need to accumulate more results. We can do this by examining the URL above.

- "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

There are two query parameters here we can alter to collect more results: the `l=New+York` and the `start=10`. The first controls the location of the results (so we can try different city). The second controls where in the results to start and gives 10 results (so we can keep incrementing this by 10 to move further within the list).

#### Complete the following code to collect results from multiple cities and start points. 
- Enter your city below to add it to the search
- Remember to convert your salary to U.S. Dollars to match the other cities if the currency is different

In [188]:
cities = ['Atlanta, GA', 'Washington, DC', 'New York, NY', 'New Orleans, LA', 'Boston, MA', 'Raleigh, NC', 
         'Austin, TX', 'Seattle, WA', 'San Francisco, CA', 'Detroit, MI', 'Minneapolis, MN', 'Cleveland, OH',
         'Denver, CO', 'Kansas City, MO', 'Phoenix, AZ', 'Pittsburgh, PA', 'Columbia, SC', 'Louisville, KY',
         'Indianapolis, IN', 'Winston-Salem, NC', 'Charleston, SC', 'Cincinnati, OH', 'Greenville, SC',
         'Portland, OR', 'Richmond, VA', 'Honolulu, HI', 'Dallas, TX', 'San Diego, CA', 'Charlotte, NC',
          'St. Paul, MN', 'Athens, GA', 'Houston, TX', 'Las Vegas, NV', 'San Jose, CA', 'Sacramento, CA', 
          'Los Angeles, CA']

In [151]:
# create template URL and max number of results (pages) to pull
url_template = "http://www.indeed.com/jobs?q=data+scientist&l={}&start={}"

In [224]:
# for loop to pull data with bs4
results = []
starts = []
city_state = []
city = city
for city in cities:
    city = city.replace(' ', '+')
    for start in range(0,191,10):
        r = requests.get(url_template.format(city, start))
        # Grab the results from the request (as above)
        soup = BeautifulSoup(r.content)
        # Append to the full set of results
        results += soup.findAll('div', { "class" : "result" })
        starts.extend([start]*len(soup.findAll('div', { "class" : "result" })))
        city_state.extend([city]*len(soup.findAll('div', { "class" : "result" })))

In [225]:
#checking that lengths match so I can ensure the data is corrent when I add it to my dataframe.
print len(results)
print len(starts)
print len(city_state)

8598
8598
8598


#### Use the functions you wrote above to parse out the 4 fields - location, title, company and salary. Create a dataframe from the results with those 4 columns.

In [293]:
# combine data into dictionaries company location summary title salary
rows = []
for i, result in enumerate(results):
    if result:
        row = {'company': get_company_from_result(result),
              'location': get_location_from_result(result),
              'summary': get_summary_from_result(result),
              'title': get_title_from_result(result),
              'salary': get_salary_from_result(result),
              'date_posted': get_date_posted(result),
              'sponsored': get_is_sponsored(result),
              'star_rating': get_star_rating(result),
              'search_city':city_state[i],
              'start': starts[i],
              'website': get_link(result), 
              'number_reviews': get_number_reviews(result)}
        rows.append(row)



In [294]:
# create dataframe
import pandas as pd
ds_jobs = pd.DataFrame(rows)
ds_jobs.shape

(8598, 12)

In [307]:
pd.options.display.max_colwidth = 1000
ds_jobs.head()

Unnamed: 0,company,date_posted,location,number_reviews,salary,search_city,sponsored,star_rating,start,summary,title,website
0,Cotiviti,30+ days ago,"Atlanta, GA",30 reviews,,"Atlanta,+GA",Sponsored,3.35,0,This is a pioneering data scientist who will participate in expanding the new analytics backbone. Cotiviti is looking for an industry leading Data Scientist to...,Data Scientist,http://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0AbexXlh6WlNaC12RNLKcRQH8fywLm61v9KQllly0vTVrm9U0Iy0AOsYwOq9YOpDX03iprvWHw_SY6xCXG90mwLvOd8fb5BdJ-fu_-2tfp_KoWry1hPm7FaVRyBGPoeYEaNltu7W5i0j-mo3JRbnfv9fjDKHocl-PPaA54t_nU0LuKHYhZrcpw0vHpj47kOqopU6QSmWmYXvVWHNR2CzxLoO9Bbb36lQrm9dkXEzjOFO0F1O8yVPXCq8Wcl22b_eNIa__1cVuYp1K_qe98wAuNTG3gLmeuB2aE7Fqu6Gf2luR5kT_R23Op04FelalPdvo3zVdIfajcqvDI7RmIOMQp4Agc3YAT_Tzhz5WVO5KPp8yXrbTSBwnW8W5Y1Baywe5ZiB7685I_AcrFG5l2N50twz3gceaTSh5P5RH3T_0QSdi7gyR3ungTSaiRr-xDjlbhPXvvcS1iInH8TeXo3X64_&p=1&sk=&fvj=0
1,MobileDev Power,8 days ago,"Atlanta, GA 30305",,,"Atlanta,+GA",Sponsored,,0,"Authoritative quantitative analysis skills, particularly in machine learning, multivariate statistical modeling, and data mining....",Data Scientist With Predictive Modeling,http://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0ADQeWqYYwuRDGK7FTA-YzU7Qag1HzehP-FB7aWzkBO6FRyVCTNkBclA9FrQyCAXLjn4u9IP3mepDeJnt-BtuBzfosurWUx8dMI4PS2dnzCGDjcb1rqHXIXHxiVvQw9rfF11WZQLdcQkIzgltmPYW2VSpeptqKlQFd-myvhXpnwsBuKpigI5qx1JZoLljXcQZaeCU-bseX3Gb_kmh_cGHBXlvclKzEw6aaFQ4dndrNfOkrV5OxKFWnhb_fOlzQNTxQBttBJJopIwAJefJijrKQXBcgZzB1GaGE40fDWPYVGPtVAC5tdnWXFQumo0kX8h0jq-m-nU1Bh6AAANXMlxN3M3ugo4UeZUSFYJksmWTvbiiI0GykzlsnC0hKZOIoCmE3R8m4yPcgAx6d4sNAe68-U4uAWb2WLQPv2yEfU_OJETV32j8w3Q_-LosbFT-kjsfJ5Wg5FzCL3Fw==&p=2&sk=&fvj=1
2,Predictive Science,7 days ago,United States,,,"Atlanta,+GA",Sponsored,,0,This is a freelance data scientist position who will work with other senior data scientists to consult with executives and data scientists who are a part of the...,Data Scientist,http://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0Bcgobz-t7weSGYTjoRWMIqcV9t0akB45B3k8NFJVZb5kkFpxn8bhfjltbVDjvx0bn0noTLqpSBW77839Ubfh4eL2ZvByfFlPegJkfgxbap4qbKzQXknClXAKQmaD18uRE6WyuXyxO_e8fzaqQSa7Zpn8Jr1yFQDtObTPJBXLI6XTwEbrzAcGrbjl0uwxUgOs3LUSSvxe-nS1s40s38EGLQgLYWQofqP_lYeeyVtdUDrrPJTJpGfp7cFRFtI_9pnC27Aw1m_ZjCo0jNx_KBHrNx3CIU2NvWCRoWJuX_JD-GV-0D9pYILfNYJitQNSic3Kbc0WlhfLBHxGbWmo8Jtbi0lkfs6KdyiXQj4mL5u4rhBXgCO-DoAn7q_N1DszG4wT7E74K_F3vserP3JdIzBcV01VNW0Q3uZWY=&p=3&sk=&fvj=0
3,"Vision3 Solutions, Inc",22 hours ago,"Atlanta, GA",,$90 an hour,"Atlanta,+GA",,,0,Data Scientist- Big Data*. May write code to automate reports and templates and consolidate data into reports and knowledge. 12 Months Contract*....,Data Scientist- Big Data,"http://www.indeed.com/cmp/Vision3-Solutions,-Inc/jobs/Data-Scientist-1a8c086f5f6f1294?r=1&fccid=37d33d67f3fba52b"
4,FraudScope,1 day ago,"Atlanta, GA",,,"Atlanta,+GA",,,0,Experience with healthcare-related data and familiarity with current methods applied to healthcare data is preferred....,Data Scientist,http://www.indeed.com/cmp/FraudScope/jobs/Data-Scientist-d72c337465398caf?r=1&fccid=e87f46501099545c


In [296]:
from IPython.display import Audio
Audio(url="http://www.soundjay.com/button/beep-01a.mp3",autoplay=True)

In [308]:
# looking at sponsored posts to see if they account for the majority of my duplicates
print 'before drop', ds_jobs.sponsored.value_counts()
print '\ndrop size', ds_jobs.drop_duplicates([x for x in ds_jobs.columns if x != 'start' and x != 'website']).shape
print '\nafter drop', ds_jobs.drop_duplicates([x for x in ds_jobs.columns if x != 'start' and x != 'website']).sponsored.value_counts()

before drop Sponsored                               2399
Sponsored by Amazon.com                  860
Sponsored by Target Corporation           40
Sponsored by Sealed Air                   40
Ad: Urgently Hiring                       22
Sponsored by Total Quality Logistics      20
Sponsored by Altria                       20
Sponsored by The Home Depot               20
Sponsored by Crowe Horwath LLP            20
Urgently Hiring                           18
Sponsored by totaljobs                    12
Sponsored by Ally Financial                1
Name: sponsored, dtype: int64

drop size (2198, 12)

after drop Sponsored                               141
Sponsored by Amazon.com                  77
Sponsored by totaljobs                    3
Sponsored by Sealed Air                   2
Ad: Urgently Hiring                       2
Urgently Hiring                           2
Sponsored by Target Corporation           2
Sponsored by Crowe Horwath LLP            1
Sponsored by Altria             

In [309]:
# making a copy so I can refer to original if needed without having to re-run
ds_jobs_clean = ds_jobs.copy()

# drop duplicates (not considering 'start' column because I want to eliminate things that appeared more than once)
ds_jobs_clean.drop_duplicates([x for x in ds_jobs.columns if x != 'start' and x != 'website'], inplace=True)

In [310]:
ds_jobs_clean.head()

Unnamed: 0,company,date_posted,location,number_reviews,salary,search_city,sponsored,star_rating,start,summary,title,website
0,Cotiviti,30+ days ago,"Atlanta, GA",30 reviews,,"Atlanta,+GA",Sponsored,3.35,0,This is a pioneering data scientist who will participate in expanding the new analytics backbone. Cotiviti is looking for an industry leading Data Scientist to...,Data Scientist,http://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0AbexXlh6WlNaC12RNLKcRQH8fywLm61v9KQllly0vTVrm9U0Iy0AOsYwOq9YOpDX03iprvWHw_SY6xCXG90mwLvOd8fb5BdJ-fu_-2tfp_KoWry1hPm7FaVRyBGPoeYEaNltu7W5i0j-mo3JRbnfv9fjDKHocl-PPaA54t_nU0LuKHYhZrcpw0vHpj47kOqopU6QSmWmYXvVWHNR2CzxLoO9Bbb36lQrm9dkXEzjOFO0F1O8yVPXCq8Wcl22b_eNIa__1cVuYp1K_qe98wAuNTG3gLmeuB2aE7Fqu6Gf2luR5kT_R23Op04FelalPdvo3zVdIfajcqvDI7RmIOMQp4Agc3YAT_Tzhz5WVO5KPp8yXrbTSBwnW8W5Y1Baywe5ZiB7685I_AcrFG5l2N50twz3gceaTSh5P5RH3T_0QSdi7gyR3ungTSaiRr-xDjlbhPXvvcS1iInH8TeXo3X64_&p=1&sk=&fvj=0
1,MobileDev Power,8 days ago,"Atlanta, GA 30305",,,"Atlanta,+GA",Sponsored,,0,"Authoritative quantitative analysis skills, particularly in machine learning, multivariate statistical modeling, and data mining....",Data Scientist With Predictive Modeling,http://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0ADQeWqYYwuRDGK7FTA-YzU7Qag1HzehP-FB7aWzkBO6FRyVCTNkBclA9FrQyCAXLjn4u9IP3mepDeJnt-BtuBzfosurWUx8dMI4PS2dnzCGDjcb1rqHXIXHxiVvQw9rfF11WZQLdcQkIzgltmPYW2VSpeptqKlQFd-myvhXpnwsBuKpigI5qx1JZoLljXcQZaeCU-bseX3Gb_kmh_cGHBXlvclKzEw6aaFQ4dndrNfOkrV5OxKFWnhb_fOlzQNTxQBttBJJopIwAJefJijrKQXBcgZzB1GaGE40fDWPYVGPtVAC5tdnWXFQumo0kX8h0jq-m-nU1Bh6AAANXMlxN3M3ugo4UeZUSFYJksmWTvbiiI0GykzlsnC0hKZOIoCmE3R8m4yPcgAx6d4sNAe68-U4uAWb2WLQPv2yEfU_OJETV32j8w3Q_-LosbFT-kjsfJ5Wg5FzCL3Fw==&p=2&sk=&fvj=1
2,Predictive Science,7 days ago,United States,,,"Atlanta,+GA",Sponsored,,0,This is a freelance data scientist position who will work with other senior data scientists to consult with executives and data scientists who are a part of the...,Data Scientist,http://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0Bcgobz-t7weSGYTjoRWMIqcV9t0akB45B3k8NFJVZb5kkFpxn8bhfjltbVDjvx0bn0noTLqpSBW77839Ubfh4eL2ZvByfFlPegJkfgxbap4qbKzQXknClXAKQmaD18uRE6WyuXyxO_e8fzaqQSa7Zpn8Jr1yFQDtObTPJBXLI6XTwEbrzAcGrbjl0uwxUgOs3LUSSvxe-nS1s40s38EGLQgLYWQofqP_lYeeyVtdUDrrPJTJpGfp7cFRFtI_9pnC27Aw1m_ZjCo0jNx_KBHrNx3CIU2NvWCRoWJuX_JD-GV-0D9pYILfNYJitQNSic3Kbc0WlhfLBHxGbWmo8Jtbi0lkfs6KdyiXQj4mL5u4rhBXgCO-DoAn7q_N1DszG4wT7E74K_F3vserP3JdIzBcV01VNW0Q3uZWY=&p=3&sk=&fvj=0
3,"Vision3 Solutions, Inc",22 hours ago,"Atlanta, GA",,$90 an hour,"Atlanta,+GA",,,0,Data Scientist- Big Data*. May write code to automate reports and templates and consolidate data into reports and knowledge. 12 Months Contract*....,Data Scientist- Big Data,"http://www.indeed.com/cmp/Vision3-Solutions,-Inc/jobs/Data-Scientist-1a8c086f5f6f1294?r=1&fccid=37d33d67f3fba52b"
4,FraudScope,1 day ago,"Atlanta, GA",,,"Atlanta,+GA",,,0,Experience with healthcare-related data and familiarity with current methods applied to healthcare data is preferred....,Data Scientist,http://www.indeed.com/cmp/FraudScope/jobs/Data-Scientist-d72c337465398caf?r=1&fccid=e87f46501099545c


Lastly, we need to clean up salary data. 

1. Only a small number of the scraped results have salary information - only these will be used for modeling.
1. Some of the salaries are not yearly but hourly or weekly, these will not be useful to us for now
1. Some of the entries may be duplicated
1. The salaries are given as text and usually with ranges.

#### Find the entries with annual salary entries, by filtering the entries without salaries or salaries that are not yearly (filter those that refer to hour or week). Also, remove duplicate entries

In [315]:
pd.options.display.max_rows = 200
ds_jobs_clean.salary.value_counts()

$80,000 - $100,000 a year     4
$130,000 a year               4
$140,000 a year               3
$20.65 an hour                3
$50,000 a year                3
$15 an hour                   3
$110,000 a year               2
$6,250 - $10,833 a month      2
$125,000 a year               2
$75,000 - $90,000 a year      2
$90,000 - $140,000 a year     2
$80,000 a year                2
$150,000 a year               2
$45,000 a year                2
$5,400 - $6,500 a month       2
$12 an hour                   2
$25 an hour                   2
$24,190 - $60,588 a year      2
$8.25 an hour                 2
$50,000 - $70,000 a year      2
$75,000 - $120,000 a year     2
$39,000 - $42,000 a year      1
$5,541 a month                1
$165,000 a year               1
$27.72 an hour                1
$74,260 - $96,538 a year      1
$45,000 - $55,000 a year      1
$90 an hour                   1
$6,667 a month                1
$40,000 - $50,000 a year      1
$13 an hour                   1
$135,000

In [338]:
import re
import numpy as np
def get_standardized_salary(salary_string):
    if salary_string:
        if re.findall('(.*) a year', salary_string):
            matches = re.findall('([0-9]+,[0-9]+)', salary_string)
            return np.mean([float(salary.replace(',', '')) for salary in matches ])
        elif re.findall('(.*) a month', salary_string):
            matches = re.findall('([0-9]+,[0-9]+|[0-9]+)', salary_string)
            return np.mean([float(salary.replace(',', '')) for salary in matches ])*12
        elif re.findall('(.*) a week', salary_string):
            matches = re.findall('([0-9]+,[0-9]+|[0-9]+)', salary_string)
            return np.mean([float(salary.replace(',', '')) for salary in matches ])*52
        elif re.findall('(.*) a day', salary_string):
            matches = re.findall('([0-9]+,[0-9]+|[0-9]+)', salary_string)
            return np.mean([float(salary.replace(',', '')) for salary in matches ])*5*52
        elif re.findall('(.*) an hour', salary_string):
            matches = re.findall('([0-9]+\.[0-9]+|[0-9]+)', salary_string)
            return np.mean([float(salary.replace(',', '')) for salary in matches ])*8*5*52
    else:
        return None
def get_how_paid(salary_string):
    if salary_string:
        if re.findall('(.*) a year', salary_string):
            return 'yearly'
        elif re.findall('(.*) a month', salary_string):
            return 'monthly'
        elif re.findall('(.*) a week', salary_string):
            return 'weekly'
        elif re.findall('(.*) a day', salary_string):
            return 'daily'
        elif re.findall('(.*) an hour', salary_string):
            return 'hourly'
        else:
            return None
# checking if it works on multiple types
print get_standardized_salary(None) 
print get_standardized_salary('20,000 to 50,000 a year')
print get_standardized_salary('$20,000 a year')
print get_standardized_salary('$20,000 a year')
print '\n'
print get_how_paid('125,000−150,000 a year')
print get_how_paid('$17.79 an hour')
print get_how_paid('2,000 - 2,889 a month')
print get_how_paid('$2000 a week')

None
35000.0
nan


yearly
hourly
monthly
weekly


  ret = ret.dtype.type(ret / rcount)


In [330]:
ds_jobs_clean['how_paid'] = ds_jobs_clean.salary.apply(get_how_paid)
ds_jobs_clean['annual_salary'] = ds_jobs_clean.salary.apply(get_standardized_salary)
ds_jobs_clean[['salary', 'how_paid', 'annual_salary']][ds_jobs_clean.salary.notnull()].sample(frac=1).head(20)

Unnamed: 0,salary,how_paid,annual_salary
383,"$125,000 - $150,000 a year",yearly,137500.0
1288,"$120,000 a year",yearly,120000.0
3178,"$125,000 a year",yearly,125000.0
1835,"$4,584 a month",monthly,55008.0
2996,"$50,000 a year",yearly,50000.0
1322,"$130,000 a year",yearly,130000.0
1896,"$90,000 a year",yearly,90000.0
4196,"$150,000 a year",yearly,150000.0
3196,$17.79 an hour,hourly,37003.2
4193,"$2,000 - $2,889 a month",monthly,29334.0


In [341]:
# checking that none of my salary info got missed
print ds_jobs_clean[(ds_jobs_clean.salary.notnull()) & (ds_jobs_clean.annual_salary.isnull())].shape[0]
print ds_jobs_clean[(ds_jobs_clean.salary.notnull()) & (ds_jobs_clean.how_paid.isnull())].shape[0]

0
0


In [357]:
[1 if x else 0 for x in ds_jobs_clean.sponsored.values ]

[1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,


In [365]:
ds_jobs_clean.search_city = ds_jobs_clean.search_city.str.replace('+' ,' ')
ds_jobs_clean.number_reviews = pd.to_numeric(ds_jobs_clean.number_reviews.str.replace(' reviews' ,'').str.replace(',' ,''))
ds_jobs_clean.head()

Unnamed: 0,company,date_posted,location,number_reviews,salary,search_city,sponsored,star_rating,start,summary,title,website,how_paid,annual_salary,is_sponsored
0,Cotiviti,30+ days ago,"Atlanta, GA",30.0,,"Atlanta, GA",Sponsored,3.35,0,This is a pioneering data scientist who will participate in expanding the new analytics backbone. Cotiviti is looking for an industry leading Data Scientist to...,Data Scientist,http://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0AbexXlh6WlNaC12RNLKcRQH8fywLm61v9KQllly0vTVrm9U0Iy0AOsYwOq9YOpDX03iprvWHw_SY6xCXG90mwLvOd8fb5BdJ-fu_-2tfp_KoWry1hPm7FaVRyBGPoeYEaNltu7W5i0j-mo3JRbnfv9fjDKHocl-PPaA54t_nU0LuKHYhZrcpw0vHpj47kOqopU6QSmWmYXvVWHNR2CzxLoO9Bbb36lQrm9dkXEzjOFO0F1O8yVPXCq8Wcl22b_eNIa__1cVuYp1K_qe98wAuNTG3gLmeuB2aE7Fqu6Gf2luR5kT_R23Op04FelalPdvo3zVdIfajcqvDI7RmIOMQp4Agc3YAT_Tzhz5WVO5KPp8yXrbTSBwnW8W5Y1Baywe5ZiB7685I_AcrFG5l2N50twz3gceaTSh5P5RH3T_0QSdi7gyR3ungTSaiRr-xDjlbhPXvvcS1iInH8TeXo3X64_&p=1&sk=&fvj=0,,,1
1,MobileDev Power,8 days ago,"Atlanta, GA 30305",,,"Atlanta, GA",Sponsored,,0,"Authoritative quantitative analysis skills, particularly in machine learning, multivariate statistical modeling, and data mining....",Data Scientist With Predictive Modeling,http://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0ADQeWqYYwuRDGK7FTA-YzU7Qag1HzehP-FB7aWzkBO6FRyVCTNkBclA9FrQyCAXLjn4u9IP3mepDeJnt-BtuBzfosurWUx8dMI4PS2dnzCGDjcb1rqHXIXHxiVvQw9rfF11WZQLdcQkIzgltmPYW2VSpeptqKlQFd-myvhXpnwsBuKpigI5qx1JZoLljXcQZaeCU-bseX3Gb_kmh_cGHBXlvclKzEw6aaFQ4dndrNfOkrV5OxKFWnhb_fOlzQNTxQBttBJJopIwAJefJijrKQXBcgZzB1GaGE40fDWPYVGPtVAC5tdnWXFQumo0kX8h0jq-m-nU1Bh6AAANXMlxN3M3ugo4UeZUSFYJksmWTvbiiI0GykzlsnC0hKZOIoCmE3R8m4yPcgAx6d4sNAe68-U4uAWb2WLQPv2yEfU_OJETV32j8w3Q_-LosbFT-kjsfJ5Wg5FzCL3Fw==&p=2&sk=&fvj=1,,,1
2,Predictive Science,7 days ago,United States,,,"Atlanta, GA",Sponsored,,0,This is a freelance data scientist position who will work with other senior data scientists to consult with executives and data scientists who are a part of the...,Data Scientist,http://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0Bcgobz-t7weSGYTjoRWMIqcV9t0akB45B3k8NFJVZb5kkFpxn8bhfjltbVDjvx0bn0noTLqpSBW77839Ubfh4eL2ZvByfFlPegJkfgxbap4qbKzQXknClXAKQmaD18uRE6WyuXyxO_e8fzaqQSa7Zpn8Jr1yFQDtObTPJBXLI6XTwEbrzAcGrbjl0uwxUgOs3LUSSvxe-nS1s40s38EGLQgLYWQofqP_lYeeyVtdUDrrPJTJpGfp7cFRFtI_9pnC27Aw1m_ZjCo0jNx_KBHrNx3CIU2NvWCRoWJuX_JD-GV-0D9pYILfNYJitQNSic3Kbc0WlhfLBHxGbWmo8Jtbi0lkfs6KdyiXQj4mL5u4rhBXgCO-DoAn7q_N1DszG4wT7E74K_F3vserP3JdIzBcV01VNW0Q3uZWY=&p=3&sk=&fvj=0,,,1
3,"Vision3 Solutions, Inc",22 hours ago,"Atlanta, GA",,$90 an hour,"Atlanta, GA",,,0,Data Scientist- Big Data*. May write code to automate reports and templates and consolidate data into reports and knowledge. 12 Months Contract*....,Data Scientist- Big Data,"http://www.indeed.com/cmp/Vision3-Solutions,-Inc/jobs/Data-Scientist-1a8c086f5f6f1294?r=1&fccid=37d33d67f3fba52b",hourly,187200.0,0
4,FraudScope,1 day ago,"Atlanta, GA",,,"Atlanta, GA",,,0,Experience with healthcare-related data and familiarity with current methods applied to healthcare data is preferred....,Data Scientist,http://www.indeed.com/cmp/FraudScope/jobs/Data-Scientist-d72c337465398caf?r=1&fccid=e87f46501099545c,,,0


In [364]:
ds_jobs_clean['is_sponsored'] = ds_jobs_clean.sponsored.apply(lambda x: 1 if x else 0)
ds_jobs_clean[['sponsored', 'is_sponsored']].head(20)

Unnamed: 0,sponsored,is_sponsored
0,Sponsored,1
1,Sponsored,1
2,Sponsored,1
3,,0
4,,0
5,,0
6,,0
7,,0
8,,0
9,,0


In [373]:
# investigating date_posted column to figure out how to best convert, what my categories should be
ds_jobs_clean.date_posted.value_counts()

30+ days ago      868
7 days ago         82
6 days ago         78
2 days ago         75
1 day ago          75
14 days ago        64
13 days ago        62
20 days ago        62
9 days ago         57
8 days ago         55
5 days ago         55
16 days ago        47
15 days ago        45
12 days ago        45
19 days ago        40
29 days ago        39
22 days ago        39
21 days ago        38
26 days ago        35
27 days ago        29
23 days ago        26
30 days ago        26
28 days ago        25
3 days ago         21
25 days ago        19
11 days ago        16
24 days ago        15
17 days ago        14
10 days ago        13
4 days ago         13
20 hours ago       10
23 hours ago        9
18 hours ago        9
11 hours ago        9
10 hours ago        8
19 hours ago        8
13 hours ago        7
21 hours ago        7
17 hours ago        7
5 hours ago         6
9 hours ago         6
1 hour ago          6
16 hours ago        5
18 days ago         5
14 hours ago        3
8 hours ag

In [446]:
def categorize_time_since_posted(time_since_posted_str):
    categories = ['in the last day', '1-6 days ago', '7-12 days ago', '13-18 days ago', '19-24 days ago', '25-30 days ago', 'more than 30 days ago']
    if re.findall('hour', time_since_posted_str) or re.findall('minutes', time_since_posted_str):
        return categories[0]
    elif re.findall('30\+', time_since_posted_str):
        return categories[-1]
    else:
        days_ago = int(re.findall('(.*) day', time_since_posted_str)[0])
        if days_ago <= 6:
            return categories[1]
        elif days_ago <= 12:
            return categories[2]
        elif days_ago <= 18:
            return categories[3]
        elif days_ago <= 24:
            return categories[4]
        else:
            return categories[5]
# test a few
print categorize_time_since_posted('39 minutes ago')
print categorize_time_since_posted('17 hours ago')
print categorize_time_since_posted('9 days ago')
print categorize_time_since_posted('4 days ago')
print categorize_time_since_posted('14 days ago')
print categorize_time_since_posted('30 days ago')
print categorize_time_since_posted('30+ days ago')

in the last day
in the last day
7-12 days ago
1-6 days ago
13-18 days ago
25-30 days ago
more than 30 days ago


In [448]:
# checking how many end up in each category - looks ok
ds_jobs_clean.date_posted.apply(categorize_time_since_posted).value_counts()

more than 30 days ago    868
1-6 days ago             317
7-12 days ago            268
13-18 days ago           237
19-24 days ago           220
25-30 days ago           173
in the last day          115
Name: date_posted, dtype: int64

In [480]:
ds_jobs_clean['time_since_posted'] = ds_jobs_clean.date_posted.apply(categorize_time_since_posted)

In [464]:
# examine where location is not the same as the search city (since some have zip codes for example, must do a regex)
def in_city_proper(search_city, location):
    if re.findall(search_city, location):
        return 1
    else:
        return 0
ds_jobs_clean['in_city'] = ds_jobs_clean.apply(lambda x: in_city_proper(x.search_city, x.location), axis=1)

In [466]:
# checking if it worked well
ds_jobs_clean[['search_city', 'location', 'in_city']].sample(frac=1)

Unnamed: 0,search_city,location,in_city
2900,"Detroit, MI",United States,0
2608,"San Francisco, CA","San Francisco, CA 94107 (South Of Market area)",1
1350,"Boston, MA","Boston, MA",1
1788,"Raleigh, NC","Durham, NC 27703",0
2367,"Seattle, WA","Seattle, WA",1
724,"New York, NY","New York, NY",1
3655,"Denver, CO","Denver, CO",1
1505,"Raleigh, NC","Raleigh, NC",1
8219,"San Diego, CA","San Diego, CA",1
6962,"Portland, OR","Portland, OR",1


In [475]:
# convert page 

def convert_to_page(start_value):
    start = range(0,191,10)
    page = range(1,21)
    d = {start:page for start, page in zip(start, page)} 
    return d[start_value]
#test
print convert_to_page(0)
print convert_to_page(10)
print convert_to_page(30)

1
2
4


In [477]:
ds_jobs_clean['page'] = ds_jobs_clean.start.apply(convert_to_page)
ds_jobs_clean[['start', 'page']].sample(frac=1)

Unnamed: 0,start,page
2668,180,19
6244,0,1
1995,130,14
2744,30,4
1195,0,1
412,70,8
1830,20,3
1318,80,9
4015,80,9
492,120,13


In [467]:
ds_jobs_clean.columns

Index([u'company', u'date_posted', u'location', u'number_reviews', u'salary',
       u'search_city', u'sponsored', u'star_rating', u'start', u'summary',
       u'title', u'website', u'how_paid', u'annual_salary', u'is_sponsored',
       u'time_since_posted', u'in_city'],
      dtype='object')

In [478]:
col = [x for x in ds_jobs_clean.columns if x not in ['location', 'salary', 'sponsored', 'start', 'date_posted']]

In [481]:
# save scraped results as a CSV for Tableau/external viz
ds_jobs_clean[col].to_csv('data_science_jobs.csv', encoding='utf-8')

## Predicting salaries using Logistic Regression

In [483]:
# load in the the data of scraped salaries
jobs = pd.read_csv('data_science_jobs.csv')
jobs.head()

Unnamed: 0.1,Unnamed: 0,company,number_reviews,search_city,star_rating,summary,title,website,how_paid,annual_salary,is_sponsored,time_since_posted,in_city,page
0,0,Cotiviti,30.0,"Atlanta, GA",3.35,This is a pioneering data scientist who will participate in expanding the new analytics backbone. Cotiviti is looking for an industry leading Data Scientist to...,Data Scientist,http://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0AbexXlh6WlNaC12RNLKcRQH8fywLm61v9KQllly0vTVrm9U0Iy0AOsYwOq9YOpDX03iprvWHw_SY6xCXG90mwLvOd8fb5BdJ-fu_-2tfp_KoWry1hPm7FaVRyBGPoeYEaNltu7W5i0j-mo3JRbnfv9fjDKHocl-PPaA54t_nU0LuKHYhZrcpw0vHpj47kOqopU6QSmWmYXvVWHNR2CzxLoO9Bbb36lQrm9dkXEzjOFO0F1O8yVPXCq8Wcl22b_eNIa__1cVuYp1K_qe98wAuNTG3gLmeuB2aE7Fqu6Gf2luR5kT_R23Op04FelalPdvo3zVdIfajcqvDI7RmIOMQp4Agc3YAT_Tzhz5WVO5KPp8yXrbTSBwnW8W5Y1Baywe5ZiB7685I_AcrFG5l2N50twz3gceaTSh5P5RH3T_0QSdi7gyR3ungTSaiRr-xDjlbhPXvvcS1iInH8TeXo3X64_&p=1&sk=&fvj=0,,,1,more than 30 days ago,1,1
1,1,MobileDev Power,,"Atlanta, GA",,"Authoritative quantitative analysis skills, particularly in machine learning, multivariate statistical modeling, and data mining....",Data Scientist With Predictive Modeling,http://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0ADQeWqYYwuRDGK7FTA-YzU7Qag1HzehP-FB7aWzkBO6FRyVCTNkBclA9FrQyCAXLjn4u9IP3mepDeJnt-BtuBzfosurWUx8dMI4PS2dnzCGDjcb1rqHXIXHxiVvQw9rfF11WZQLdcQkIzgltmPYW2VSpeptqKlQFd-myvhXpnwsBuKpigI5qx1JZoLljXcQZaeCU-bseX3Gb_kmh_cGHBXlvclKzEw6aaFQ4dndrNfOkrV5OxKFWnhb_fOlzQNTxQBttBJJopIwAJefJijrKQXBcgZzB1GaGE40fDWPYVGPtVAC5tdnWXFQumo0kX8h0jq-m-nU1Bh6AAANXMlxN3M3ugo4UeZUSFYJksmWTvbiiI0GykzlsnC0hKZOIoCmE3R8m4yPcgAx6d4sNAe68-U4uAWb2WLQPv2yEfU_OJETV32j8w3Q_-LosbFT-kjsfJ5Wg5FzCL3Fw==&p=2&sk=&fvj=1,,,1,7-12 days ago,1,1
2,2,Predictive Science,,"Atlanta, GA",,This is a freelance data scientist position who will work with other senior data scientists to consult with executives and data scientists who are a part of the...,Data Scientist,http://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0Bcgobz-t7weSGYTjoRWMIqcV9t0akB45B3k8NFJVZb5kkFpxn8bhfjltbVDjvx0bn0noTLqpSBW77839Ubfh4eL2ZvByfFlPegJkfgxbap4qbKzQXknClXAKQmaD18uRE6WyuXyxO_e8fzaqQSa7Zpn8Jr1yFQDtObTPJBXLI6XTwEbrzAcGrbjl0uwxUgOs3LUSSvxe-nS1s40s38EGLQgLYWQofqP_lYeeyVtdUDrrPJTJpGfp7cFRFtI_9pnC27Aw1m_ZjCo0jNx_KBHrNx3CIU2NvWCRoWJuX_JD-GV-0D9pYILfNYJitQNSic3Kbc0WlhfLBHxGbWmo8Jtbi0lkfs6KdyiXQj4mL5u4rhBXgCO-DoAn7q_N1DszG4wT7E74K_F3vserP3JdIzBcV01VNW0Q3uZWY=&p=3&sk=&fvj=0,,,1,7-12 days ago,0,1
3,3,"Vision3 Solutions, Inc",,"Atlanta, GA",,Data Scientist- Big Data*. May write code to automate reports and templates and consolidate data into reports and knowledge. 12 Months Contract*....,Data Scientist- Big Data,"http://www.indeed.com/cmp/Vision3-Solutions,-Inc/jobs/Data-Scientist-1a8c086f5f6f1294?r=1&fccid=37d33d67f3fba52b",hourly,187200.0,0,in the last day,1,1
4,4,FraudScope,,"Atlanta, GA",,Experience with healthcare-related data and familiarity with current methods applied to healthcare data is preferred....,Data Scientist,http://www.indeed.com/cmp/FraudScope/jobs/Data-Scientist-d72c337465398caf?r=1&fccid=e87f46501099545c,,,0,1-6 days ago,1,1


In [487]:
jobs_w_salary = jobs[jobs.annual_salary.notnull()]
jobs_w_salary.shape

(149, 14)

#### We want to predict a binary variable - whether the salary was low or high. Compute the median salary and create a new binary variable that is true when the salary is high (above the median)

In [489]:
# calculate median and create feature with 1 as high salary

jobs_w_salary['high_paid'] = jobs_w_salary.annual_salary.apply(lambda x: 1 if x> jobs_w_salary.annual_salary.median() else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


Unnamed: 0.1,Unnamed: 0,company,number_reviews,search_city,star_rating,summary,title,website,how_paid,annual_salary,is_sponsored,time_since_posted,in_city,page,high_paid
3,3,"Vision3 Solutions, Inc",,"Atlanta, GA",,Data Scientist- Big Data*. May write code to automate reports and templates and consolidate data into reports and knowledge. 12 Months Contract*....,Data Scientist- Big Data,"http://www.indeed.com/cmp/Vision3-Solutions,-Inc/jobs/Data-Scientist-1a8c086f5f6f1294?r=1&fccid=37d33d67f3fba52b",hourly,187200.0,0,in the last day,1,1,1
12,12,Centers for Disease Control and Prevention,66.0,"Atlanta, GA",4.5,"Whether we are protecting the American people from public health threats, researching emerging diseases, or mobilizing public health programs with our domestic...",Statistician (Health),http://www.indeed.com/rc/clk?jk=1ae9fbf052162ca0&fccid=3e901f592b439cea,yearly,101553.5,0,1-6 days ago,1,1,1
33,56,Analytic Recruiting,2.0,"Atlanta, GA",4.25,"Experience with Teradata SQL, MS SQL server Data Visualization (e.g., Tableau or other), Access, Excel, Visual Basic,....",Junior Data Scientist,http://www.indeed.com/rc/clk?jk=78b9859975833fda&fccid=3b4e0f2c2deb87d6,yearly,82500.0,0,19-24 days ago,0,4,1
49,82,Centers for Disease Control and Prevention,66.0,"Atlanta, GA",4.5,"Department of Health and Human Services (DHHS), Centers for Disease Control and Prevention (CDC), Center for Surveillance, Epidemiology and Laboratory (CSELS),...",Health Scientist,http://www.indeed.com/rc/clk?jk=34cd7e560e06b0a0&fccid=3e901f592b439cea,yearly,85399.0,0,7-12 days ago,1,6,1
62,115,Stackfolio,,"Atlanta, GA",,A competitive full-time salary as well as a great options package that will only be available to this early batch of hires....,Lead Data Scientist,http://www.indeed.com/company/Stackfolio/jobs/Lead-Data-Scientist-be3a7444db34e9a3?r=1&fccid=5a5990c5dca07ad0,yearly,80000.0,0,more than 30 days ago,1,8,1


In [492]:
jobs_w_salary[['high_paid', 'annual_salary']].sample(frac=1)

Unnamed: 0,high_paid,annual_salary
714,0,69996.0
351,0,68908.5
1809,0,70000.0
692,0,71400.0
638,0,52000.0
1601,1,130000.0
1472,0,29234.4
1226,0,32240.0
600,0,42389.0
2005,1,90000.0


In [505]:
pd.pivot_table(jobs_w_salary, index='search_city', values='annual_salary', aggfunc='count')

search_city
Atlanta, GA           8
Austin, TX           20
Boston, MA            6
Charlotte, NC         4
Cincinnati, OH        1
Cleveland, OH         4
Dallas, TX            7
Denver, CO            9
Detroit, MI           4
Honolulu, HI          2
Indianapolis, IN      2
Kansas City, MO       1
Minneapolis, MN       6
New Orleans, LA       2
New York, NY          7
Phoenix, AZ           5
Pittsburgh, PA        4
Portland, OR         10
Raleigh, NC          17
Richmond, VA          2
San Diego, CA         5
San Francisco, CA     9
Seattle, WA           3
Washington, DC       11
Name: annual_salary, dtype: int64

### Q: What is the baseline accuracy for this model?

It is 50% if we guess randomly, half the salaries will be below the median and half will be above.

#### Create a Logistic Regression model to predict High/Low salary using statsmodel. Start by ONLY using the location as a feature. Display the coefficients and write a short summary of what they mean.

In [497]:
# create statsmodel and summary
import statsmodels.formula.api as sm
import patsy

In [502]:
print X

[[ 1.  0.  0. ...,  0.  0.  0.]
 [ 1.  0.  0. ...,  0.  0.  0.]
 [ 1.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 1.  0.  0. ...,  0.  0.  0.]
 [ 1.  0.  0. ...,  0.  0.  0.]
 [ 1.  0.  0. ...,  0.  0.  0.]]


In [514]:
y, X = patsy.dmatrices('high_paid ~ search_city', data=jobs_w_salary)
sm.Logit(y, X, method='basinhopping').fit().summary()

         Current function value: 0.404153
         Iterations: 35


LinAlgError: Singular matrix

#### Create a few new variables in your dataframe to represent interesting features of a job title.
- For example, create a feature that represents whether 'Senior' or 'Manager' is in the title 
- Then build a new Logistic Regression model with these features. Do they add any value? 


In [None]:
# create senior, director, and manager dummies
salary_data['is_senior'] = salary_data['title'].str.contains('Senior').astype(int) # example


#### Rebuild this model with scikit-learn.
- You can either create the dummy features manually or use the `dmatrix` function from `patsy`
- Remember to scale the feature variables as well!


In [None]:
# scale, (patsy optional), and fit
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from patsy import dmatrix

scaler = StandardScaler()
model = LogisticRegression(penalty = 'l2', C=0.1)


#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy, AUC, precision and recall of the model. 
- Discuss the differences and explain when you want a high-recall or a high-precision model in this scenario.

In [None]:
from sklearn.cross_validation import cross_val_score

for metric in ['accuracy', 'precision', 'recall', 'roc_auc']: # example
    

### Compare L1 and L2 regularization for this logistic regression model. What effect does this have on the coefficients learned?

In [None]:
model = LogisticRegression(penalty = 'l1', C=1.0)

for metric in ['accuracy', 'precision', 'recall', 'roc_auc']:
    

In [None]:
model.fit(X_scaled, y)

df = pd.DataFrame({'features' : X.design_info.column_names, 'coef': model.coef_[0,:]})
df.sort_values('coef', ascending=False, inplace=True)
df

#### Optional: Continue to incorporate other text features from the title or summary that you believe will predict the salary and examine their coefficients. Take ~100 scraped entries with salaries. Convert them to use with your model and predict the salary. Which entries have the highest predicted salaries?

# Bonus Section: Use Count Vectorizer from scikit-learn to create features from the text summaries. 
- Examine using count or binary features in the model
- Re-evaluate the logistic regression model using these. Does this improve the model performance? 
- What text features are most valuable? 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform()

for metric in ['accuracy', 'precision', 'recall', 'roc_auc']:
    scores = cross_val_score(model, X_scaled, y, cv=3, scoring=metric)
    print(metric, scores.mean(), scores.std())

In [None]:
model.fit(X_scaled, y)

df = pd.DataFrame({'features' : vectorizer.get_feature_names(), 'coef': model.coef_[0,:]})
df.sort_values('coef', ascending=False, inplace=True)

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
# retest L1 and L2 regularization
from sklearn.linear_model import LogisticRegressionCV

model = LogisticRegressionCV()


Score: | /24
------|-------
Identify: Problem Statement and Hypothesis | 
Acquire: Import Data using BeautifulSoup| 
Parse: Clean and Organize Data| 
Model: Perform Logistic Regression| 
Evaluate: Logistic Regression Results	|
Present: Blog Report with Findings and Recommendations		| 
Interactive Tableau visualizations | 
Regularization |
Bonus: Countvectorizer  | 