# **Introduction to Python: The Why, The Tools, and The Building Blocks**


## **Why use a programming language like Python for data analysis?**



There are many advantages to using a programming language like python for data analysis. We'll dive into those reasons throughout this class-- but to start answering this question, I'll use spreadsheet software (e.g. Excel) as a baseline comparison, as most people who have worked with data in school or at work are familiar with gathering, processing, analyzing, and presenting data using popular spreadsheet software. 

There are some key advantages in using spreadsheets for data analysis, but also some key drawbacks when compared to Python. Let's look at common tasks in data analysis, and see we can/can't do in spreadsheets and a programming language like Python.

### **Spreadsheets vs. Programming**

What can a spreadsheet do that a programming langauge cannot? And vice versa?

|  Task  |  Spreadsheet | Programming Language |
|---|---| --- |
| Installation | **Easier** | **Harder** |
| Learning how to use | **Easier** | **Harder** |
| Importing/gathering data from a wide variety of formats | **Limited** | **Yes**|
| Organizing/processing data  | Yes | Yes **(including large datasets)** |
| Vizualization | Yes | Yes |
| Repeated and automated analysis | **No/Limited** | **Yes**  |
| Advanced analytics  (e.g. prediction, statistical analysis) | **Limited** | **Yes** | 
| Sharing analyses | Yes *(possible advantage when sharing with non-technical users)* | Yes | 
| Integrating work into software applications | **Limited** | **Yes** |
| Paying for it | **Not Free (usually)** | **Free** |
| Getting support | **Paid-options Available** | **Open-Source Community** |

### **A note about Python vs R for data analysis**

You may have heard of both [R](https://www.r-project.org/) and Python when doing your research on programming langauges for data anlaysis--as they are the most popular languages used by data scientists and analysts who perform their work in scripting languages.

Here is a brief comparison between the two programming languages:

| Area of Comparison | Python | R |
|---|---| --- |
| Data gathering | Yes | Yes |
| Data processing | Yes | Yes |
| Data analysis | Yes | Yes |
| Data visualization | Yes | Yes |
| General programming (robotics, software development, etc.) | **Yes** | **Not really** |
| Cost | Free | Free |

This is a gross simplification of the comparison between the two languages-- and the degree to which one is better than the other is the subject to debate and often a matter of personal preference. 

**Regardless** *Programming is a mindset and a thought process - once you know one language, you can easily pickup the next one.* Think of this class as the beginning of a journey in learning how to _script_, not how to program in Python. 

## **Tools**



### **In this course**

In our course, we'll be using [Google Colab notebooks](https://colab.research.google.com/notebooks/intro.ipynb) to quickly focus on the basics of Python for HR Analytics and Reporting (instead of spending time together making sure you have installed Python properly on your computer). 

Google colab is a cloud service that runs Python 3 **notebooks**, and gives you access to a lot of popular Python libraries (code banks that let you do lots of cool and interesting things), and apps. 

### **After this course (Notebooks vs. Scripts)**

Running analyses in notebooks (like in Google Colab), may be all you ever need in your Python journey. However, you should know that Notebooks are only one of the two major ways of running code in Python:

![](https://1.bp.blogspot.com/-srqLioiiMPY/Wm6_08uBo4I/AAAAAAAAEZY/uLNCMW0xa18sVvR0ArykpToA79yNQ958QCLcBGAs/s640/Capture.JPG)

**Source:** Kaggle

In short, **Notebooks** are excellent vehicles for exploratory analysis and sharing results with non-technical audiences, whereas **scripts** enable reproducibility and integration into software applications. 




<br>
<br>

#### **Installing Python on your own computer (aka becuase you probably can't use Google Colab at the office)**

Generally, the [Anaconda Distribution](http://www.anaconda.org) of python is the most popular for data science and analysis. It includes a Notebook engine (Jupyter) and a script interactive development environemnt (IDE) called Spyder. 

However, there are other distributions of Python properly on your local machine sometimes can be a bit of a dautning task (especially if you are installing it on your company's laptop), the [Hitchhiker's Guide to Python](https://docs.python-guide.org) has clear instructions for different operating systems.


### **When you are stuck (EVERYONE gets stuck)**

I found this tweet recently that I think is important for beginners to realize. **Even the most exepreinced data scientists and analysts out there need to Google because they don't have everything memorized!**

![](https://drive.google.com/uc?id=1kAPHx1eTGhVgH27sqhD4Ew2uyc012F8l)

* First and foremost, you should use `help()` in the Python console to better understand how to use a function you want to employ in your code.

*  [Stack Overflow](http://www.stackoverflow.com) will be among search results for any query that has "python" plus any other term you insert into the query. This is a community of data scientists, statistical programmers, and developers who answer and upvote answers to common (and not so common) questions.

* [Kaggle](http://www.kaggle.com) also has tutorials, dummy datasets, and forums for you to follow or ask questions. 

* Finally ***Excel*** or ***Google Sheets*** is a very useful tool to help you visualize and check what you are trying to acheive with Python code. Many beginners will perform their work in Python side-by-side with Excel or Google Sheets to make sure each step of their code matches what they expect to see in a spreadsheet; eventually will be comfortable enough to move away from the spreadsheet software :). 

## **Building Blocks of Python for Data Analysis**


### **Built-in Python Data Types**

|  | Python Data Types | Example | Add x  | Delete x |
|:---|:---|:---|:---|:---|
| 0 | Boolean | `True`, `False` | NA | NA | 
| 1 | Numeric | `12`, `19.5` | `value + x` | `value - x` |
| 2 | Strings | `"People Analytics"`, `'Human Resources 101'` | `value + x` | NA |
| 3 | Lists |  `[1, 4, 7, 10]`, `["People", "Analytics", "is", "Awesome!"]`| `listname.append(x)` | `listname.remove(x)` | 
| 4 | Dictionaries | `{"Elon Musk": "Renewable Energy", "Steve Jobs": "Computers"}`| `dictionaryname["Bill Gates"] = "Computers"` | `del dictionaryname["Bill Gates"]` |

**Note:** There are other data types including tuples and sets, but we will not cover them in this introductory course. 

In [None]:
# Numeric Data Example 
x = 21
y = 7
x + y

28

In [None]:
#  String Example
x = "People Analytics "
y = "is Amazing"
x + y

'People Analytics is Amazing'

In [None]:
# List Example
x = ["Human", "Resources", "101"]
x

['Human', 'Resources', '101']

In [None]:
# Dictionary Example
all_about_bennet= {'Name':'Bennet',
                   'Country':'United States',
                   'State': 'New Jersey',
                   'Favorite Food':"Joe's Pizza on 6th Ave",
                   'Favorite form of exercise':'Peloton'}

all_about_bennet['Country']

'United States'

### **Basics of Manipulating Built-In Data Types**

#### **Slicing** 



Lists and strings can be decomposed into their constituents through **slicing**.

```
x = ["Human", "Resources", "101"]
Element 0          1         2   
```

A slice is specified through square brackets. 

***NOTE!!!*** Notice how the first element of a list begins with `0` in Python, and not the number `1`. Remember this now to save potential heartbreak later.

If we wanted to get the element `"101"` from x above, we would call `x[2]`.

Here's an example with a string object
```
x = "Bennet"
     012345
```
So get the first letter of x, we call `x[0]`

**Question:** How do we get the last letter of x?

In [None]:
x = "Bennet"
x[-2]

'e'

We can also get more than one element from a string, list, or dictionary. Let's get the word "People" out of "People Analytics", since "People" has 6 letters in it, we need to get everything from the first letter (`0`) to the 6th (which is a `6` because treats the end of the range as exclusive, so instead of writing a `5` we write a `6` below). 

In [None]:
x = "People Analytics"
x[7:]

'Analytics'

#### **Adding elements to lists and dictionaries**

##### **Lists**

We'll use the `.append()` method (this is Python term for a function that can be applied directly to that kind of object-- we'll go deeper on this later in the course) to add stuff we want to include in our lists.

In [None]:
states_with_best_pizza = ['New York','New Jersey']

states_with_best_pizza.append('Illinois')

states_with_best_pizza

['New York', 'New Jersey', 'Illinois']

##### **Dictionaries**

To add a new element to a dictionary, we put specify the name of the new key in square brackets, and then set it equal to what we ant the value pair to be.

In [None]:
all_about_bennet['Least Favorite Smell'] = 'Midtown Manhattan subway platforms in 100 degree weather'

all_about_bennet

{'Country': 'United States',
 'Favorite Food': "Joe's Pizza on 6th Ave",
 'Favorite form of exercise': 'Peloton',
 'Least Favorite Smell': 'Midtown Manhattan subway platforms in 100 degree weather',
 'Name': 'Bennet',
 'State': 'New Jersey'}

#### Removing elements from lists and dictionaries

##### **Lists**

We'll use the `.remove()` method to take stuff out of lists we don't want. 

In [None]:
x = ["Human", "Resources", "101"]
x.remove("Human") 
x

['Resources', '101']

##### **Dictionaries**

We'll use the `del` function to remove an element from a dictionary

In [None]:
del all_about_bennet['Least Favorite Smell']

all_about_bennet

{'Country': 'United States',
 'Favorite Food': "Joe's Pizza on 6th Ave",
 'Favorite form of exercise': 'Peloton',
 'Name': 'Bennet',
 'State': 'New Jersey'}

### **Basics of programming for data wrangling**

**Comparisons** are central to data wrangling.
This is when we compare two (or more) variables to determine if a condition is true.

There are three key features:
- Two variables being compared.
- How the variables are being compared, which is known as an **operator**.
- The output is a Boolean, `True` or `False`.

| Operator | Meaning |
|:---|:---|
| < | strictly less than |
| <= | less than or equal |
| > | strictly greater than |
| >= | greater than or equal |
| == | equal |
| != | not equal |
| is | object identity |
| is not | negated object identity |

Further information on operators is [detailed here](https://docs.python.org/3/library/stdtypes.html#comparisons).



In [None]:
10 > 1

True

In [None]:
10 == 1

False

In [None]:
10 is 1

False

In [None]:
'HR' is 'HR'

True

In [None]:
'HR' == 'HR'

True

In [None]:
x = 10
y = 10
x == y

True

In [None]:
x = 'HR'
y = "HR"
x == y

True

In [None]:
'Human Resources' > 'People Analytics'

False

#### **If/Else conditions**

Comparisons are powerful because we can use them in our code to identify certain types of data. Using the example from above, we can identify players that are older than 29, or who play for the Boston Celtics. 

**Conditional** statements allow us to do different things depending on the result of a comparison or Boolean variable, which we refer to as a condition. The logic looks like this:
- if a condition is true, do something.
- if a conditions is false, do a different thing (or nothing!).


In [None]:
avery_age = 19
cutoff_age = 29
if avery_age > cutoff_age:
  print("Avery meets the cutoff")

In this example we use two variables, `avery_age` and `cutoff_age`, which are used as part of the if statement to test whether `avery_age` is greater than `cutoff_age`. 

**Question:** Why is there no output from this operation?

`elif` stands for `else if` and enables us to test for multiple conditions. 

In [None]:
avery_age=21

if avery_age > cutoff_age:
  print("Avery meets the cutoff")
elif avery_age > 21:
  print("Avery is older than 21 but does not meet the cutoff")
else:
  print("Avery is 21 or younger than 21")

Avery is younger than 21


#### **For Loops**

A **for loop** is used to iterate over a sequence. 

A sequence can be a string, list, dictionary or other data types. 


In [None]:
name_list = ["Avery", "John", "Jonas", "Jordan"]
for x in name_list:
  print(x)

Avery
John
Jonas
Jordan


In [None]:
name = "AverY"
for something in name:
  print(something)

A
v
e
r
Y


In [None]:
name_dictionary = {"Avery": "Bradley", "John": "Holland", "Jonas": "Jerebko","Jordan": "Mickey",'Bennet':'Voorhees'}
for x,y in name_dictionary.items():
  print(x, y)

Avery Bradley
John Holland
Jonas Jerebko
Jordan Mickey
Bennet Voorhees


#### **Putting for loops and conditionals together**



In [None]:
for key, value in name_dictionary.items():
  if (key == "John" and value=='Holland'):
    print("We found a John Holland:", key, value)
  elif key == "Bennet":
    print("Bennet's been found!", key, value)
  elif key == "Stela":
    print("We found Stela!", key, value)
  

We found a John Holland: John Holland
Bennet's been found! Bennet Voorhees


### **Libraries**

Libraries are bundled modules of code, documentaiton, and other resources that help you achieve tasks quickly in Python. 


![](https://pydsc.files.wordpress.com/2017/11/pythonenvironment.png)

Source (Continuum Analytics - Makers of [anaconda](http://www.anaconda.org)

In this course, we'll call up a variety of these libraries to complete certain tasks: 

|  Library  |  Purpose | 
|---|---| 
| Numpy | Working with data in arrays |
| pandas| Working with data in table format | 
| Sci-kit Learn | Machine learning and statistical analysis of data | 
| StatsModels | Statistical analysis of data  |
| NLTK | Natural language processing |
| matplotlib | Data Visualization |
| NetworkX | Analysis of networks |  

However, you will find that depending on the task, you may want to use a different library that's listed in the graphic above, or use a library you find via the internet. 

#### **Importing Libraries to work with them**
For example, to call Pandas (or to call Pandas as pd in our code):

`import pandas` or `import pandas as pd`


### **Pandas**

Pandas (PANel DAta analysiS) is an incredibly useful library (a reusable bank of code) that allows for manipulation and analysis of data in Python. 

Pandas allows us to take data that we reviewed above and put them in to **row and column** format, like we are used to with spreadsheets :). More about this in the ***Series*** section below.




#### **The Dataframe**

The primary data structure in Pandas is a [**DataFrame**](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe), which has two key features:

- It is a two dimensional object which has labelled axes (columns and rows). 
- It is mutable, which means you can modify it, which is handy in data wrangling.  

![](https://www.geeksforgeeks.org/wp-content/uploads/creating_dataframe1.png)

Importantly, you can reference specific cells by using indices and column names. 

Let's assume
- the name of this dataframe is `my_df`; and,
- we want to know Jordan Mickey's age.

This information is in the `Age` column at index `3`.

In the world of Pandas, this is `my_df.at[3, "Age"]`.



#### **Series**

In the example above, each column is made up of a single data type. For instance, the `Name`, `Team`, `Position` and `College` columns contain text data while `Number`, `Age`, `Weight` and `Salary` contain numeric data. 

Each column is a [Series](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#series), which is the atomic data type that makes up a DataFrame. These are one-dimensional arrays that have axis labels.

Remember our atomic data types we reviewed above? here is how they are classified in Pandas. 

|  | Python Data Types | Pandas Equivalent |
|:---|:---| :---|
| 0 | Boolean |  `bool` |
| 1 | Numeric | `int64`, `float64` |
| 2 | Strings | `object` |
| 3 | Lists |  NA |
| 4 | Dictionaries |  NA |

In the example above, the column `Name` (like all the other columns) is a **Series**. 

We can extract this Series by calling `my_df['Name']`, which would yield:

```
0     Avery Bradley
1     John Holland
2     Jonas Jerebko
3     Jordan Mickey
4     Terry Rozier
5     Jared Sullinger
6     Evan Turner
Name: Name, Length: 389, dtype: object
```

#### Vectorized Operations

Pandas is an excellent data science tool because you can efficiently manipulate a lot of data by applying functions to specific columns. 

An example of a vector function is to divide `Salary` by `Age` to derive an age-adjusted salary. 

**Question:** How would we code this?