# Intro to Data Science

## Stats 141B

## Lecture 1 -- 9/24/21







<img src='images/datascience.png' width=600px>

## Voice from industry

**What is a data scientist?**

“essential skills to me are things like sound probabilistic reasoning, statistics, and rational decision theory. I like candidates who can think hard about what to measure, are cognizant of issues related to sampling bias, hypothesis testing, who understand Bayesian updating, understand something about causal modeling, and have taken some undergrad economics. Also, curiosity and a desire to dive into the data is important.”

-Ted Sandler, Amazon


## Voice from industry

1. know how to be productive on linux
2. know how to do scientific programming in python and/or R, preferably both
3. know how to do data munging with unix tools or with Python/Perl
4. have some experience with databases and SQL
5. have experience coding in a statically typed language like Java/C++/Scala 
6. will have done some work with map-reduce, PIG, or Spark (presumably PyTorch/Tensorflow)
7. have done some data visualization projects

-Ted Sandler, Amazon

## Art and science of understanding data

**Typical statistical process:**

1. Come up with a model for the data (talk to domain scientists)
2. Determine an algorithm to estimate modelling parameters
3. Run algorithm
4. Inference/confidence interval for parameters

## Art and science of understanding data

**Increased data prevalence**

> *Old way:* designed experiments to answer specific questions in controlled environment

> *New way:* Ask questions and go find/extract data, repeat


## Art and science of understanding data

**Real life data science process:**
1. Broad question or curiosity
2. Find and extract relevant data
3. Exploratory data analysis and visualization
4. Formulate specific questions
5. Use statistical/machine learning methods
6. Communicate your findings


## Purpose of this course

1. Get familiar with open source technology for data science
2. Get confident with data munging (data science by any means necessary)
3. Explore principles of data processing and communication
4. Learn collaborative/software development tools (versioning systems)
5. Witness interplay between statistics and data processing/visualization
6. Start your data science portfolio
7. Make you feel alive again, if just for a little while


## Jobs where these skills can help

Everything…

1. Statistician
2. Big data (google / amazon / etc.)
3. Insurance / biotech / sociology / physics / etc.
4. Startup
5. Data journalism
6. Management / marketing / sales
7. Public policy / politics


## Job Market Analysis

### by Jiewei Chen and Da Xue

### Stats 141B (winter 2017) final project

- used Indeed.com api
- Compared statistics jobs to chemical engineering
- Extracted qualifications, salaries, etc. from job descriptions
- project at https://celinechen0211.github.io/JobMarket/jobmarket.html
- the following is reproduced from their final project with their permission

In this part, we analyzed the __skill set__, __degree requirement__, __salary level__ and __distribution of jobs__ for different majors. <br>

To get the __skill set__ required for different majors, we did natural language processing on the posts of the jobs, and extracted the top skills that are related to different majors. Further analysis was also done by comparing those skill sets.

The __degree requirement__ were compared across different majors and different job types (internship and fulltime).

The __annual salary levels__ are compared across different majors. They were extracted from the job discription text by using natual language processing and regular expression. Further analysis on the skill requiremnt of different salary levels for statistics major was also performed. 

The __distribution of company locations__ were extracted across different majors and geographical visualization techniques were used.

| Information      | Methodology                         |
|------------------|-------------------------------------|
|Skill set         | Natural language processing / Graphs|
|Degree requirement| Natural language processing / Graphs|
|Salary            | Regular expression / Graphs         |
|Job market demand | Geographical data analysis / Data visualization|

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from collections import Counter

import string
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer   # sklearn --- primer machine learning package
from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
from matplotlib_venn import venn3
from wordcloud import WordCloud
from mpl_toolkits.basemap import Basemap
import matplotlib.patches as mpatches
import re

In [2]:
def diction_qual(files):
    """
    Returns a dictionary from lemmata to document ids containing that lemma
    Input is a list of job description text
    Output is a dictionary with lemmata as key and document ids as values
    """
    textd = {} 
    for i in range(len(files)):
        # loop over each raw text
        t = files[i]
        # return unique and order list of words appeared in the raw text
        s = set(lemmatize(t))- stop - set(string.punctuation)
        try:
            toks = toks | s   # append to "toks" set a "s"
        except NameError:
            toks = s    # if doesn't exsit, initialize it
        for tok in s:
            try:
                textd[tok].append(i)
            except KeyError:
                textd[tok] = [i]
    
    return textd

![](images/market_skills.png)

Since we two are very familiar with our own majors, in our next step, we only dug into the unique skill set for statistic major.  After some research, we generated our own key words for statistic major to see which skill is the most important and basic skill we need to learn.

This is a bar plot showing the portion of jobs requiring each technique. *It can be found that “excel”, “sql”, “python”,“r” are the top four skills which are largely required by companies. In later sections, we also compared skill sets of different salary level.*

![](images/income_skills.png)

So the next question is how can you earn more? Is there any particular skill that can boost your salary level?
<p>
Unfortunately, there is no quick answer to this question. We equally differentiated the job posts according to the annual salary they offer. The high salary jobs are indicated by the red bars, and the jobs in the lowest job salary level group are represented by the lightest color. The most popular skill set including r, python, and hadoop, basically the content of the 141abc course series, are all commonly required regardless of the salary level. Some other skills are required by some jobs but not the others. </p>
<p>
We believe there should be some correlation between the skill set and the salary level. But we did not find any particular skill that is special for the high salary level group. One possible reason is the data limitation. Not all the job posts would specify the salary level, and only about 10% of the job posts we scraped provided the salary information. The sample size is relatively small to begin with. Another hypothesis is that it may be the size of your skill set, instead of a special skill, that determines your salary level. </p>

![](images/violin_salary.png)



## Job Posting Analysis: Data Scientist Vs Software engineer from Cybercoder

### by Yingxi Yu, Xinyi Hou, Shengjie Shi, Hongyu Guo

### Stats 141B (winter 2017) final project

- webscraped cybercoder.com
- compared data scientist job postings and software engineer
- project at https://madscientistkris.github.io/projects/cybercoders/Project_edited_version/
- the following is reproduced from their final project with their permission

<p>&emsp;&emsp; "<a href="https://www.cybercoders.com/">CyberCoders</a> is one of the innovative employment search website in the state. The version of cybercoder’s website is really clear and formatted. Since their posts have no outside links like other employment search websites, it is easier to get the content of each post to construct a data frame. Also, this website focuses more on the IT related job markets, so it is perfect for us to analyze content. Additionally, this website is well organized and frequently update since we found the most of job are posted within 10 days.  
</p>

<p> &emsp;&emsp;In our project, we get the information of 109 Data Scientist and 200 Software Engineer job postings on CyberCoders through web scraping, which includes the job title, id, description, post data, salary range, preferred skills, city, and state. We compare the salary of DS and SDE, also including the comparison among different part of US. What is more, we find the need of years of experience through regular expression, the most important skills through NLP techniques. The degree required for the job and the posting dates are also topics we are interested in."
</p>

![](images/cyber_DF.png)

![](images/salary_dist.png)

The salary comparison between Data Scientist and Software Engineer show typically higher salaries for DS (far higher than the Statistics in previous analysis).  Also we see bimodality in DS distribution.

![](images/east-west-salary.png)

The bimodality can be explained by the differences in salaries between east and west.

![](images/cyber_skills.png)

These skills are more technical than the indeed.com Statistics skills.  They are also more reflective of the course material in 141B.

![](images/skills_word_cloud.png)

## IPython

- install with anaconda or with package manager
- python shell with lots of added features
- tab completion
- magic commands - start with %
- shell commands - start with !
- jupyter notebook

## Text editor

- opens unformatted text files
- syntax highlighting
- popular choices: vim, emacs, atom, sublimetext

## IPython - text editor workflow

- run simple commands in ipython
- ``%save`` to temp file, move to python file and clean up
- run longer version (on full data) or save for reproducibility
- run in ipython with ``%run``
- run in terminal with ``python scriptname.py``

## Jupyter notebook

- ``jupyter notebook`` from terminal
- runs ipython in background
- notebook consists of
 - markdown cells: 
 - code cells
- code cells act like ipython prompt: tab completion, magic commands, etc.
- markdown is a markup language that makes formatted text
 - e.g. `#` makes header text,  `$ \alpha  $`  makes latex equations as in $\alpha$
 - see [markdown cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)

In [None]:



Lecture given by A. Farris;

©James Sharpnack