<a href="https://colab.research.google.com/github/czymaraclass/intros/blob/main/Intro_to_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to Python

This intro to Python is a translation of the R intro provided [here](https://colab.research.google.com/github/czymara/czymara.github.io/blob/master/_teaching/Intro_to_R.ipynb). Thus, it is tailored for the transition from R to Python.

## Math operators

- Addition `+`
- Subtraction `-`
- Multiplication `*`
- Division `/`
- Exponentiation `^`
- Exponential `exp()`
- Logarithm `log()`
- And everything else...

In [None]:
# Example

3 + 2

# Operators can also be combined

(3 + 5) / (4 * 2)

# But we don't need Python for that…

1.0

## Objects

Similar to R, Python also allows to store information in objects.
While the assignment operator for R is `<-`, it is `=` for Python (note that you could also use `=` in R, but it's not popular).
To display the value of a variable, you use the `print()` function.

In [None]:
result_1 = 3 + 5

print(result_1)

result_2 = 4 * 2

print(result_2)

8
8


Similar to R, these objects can be recalled:

In [None]:
result_3 = result_1 / result_2

print(result_3)

1.0


## Logical operators

The logical operators work the same in R and Python:

In [None]:
result_1 == result_2

result_1 == result_3

result_1 != result_3

result_1 > result_3

True

Operators can, again, be combined. However, `&` in R is equivalent to `and` in Python, while `|` in R is equivalent to `or` in Python:

In [None]:
8 == 8 and 8 > 1  # Both statements must be true

8 == 8 and 8 == 1

8 == 8 or 8 == 1  # Only one statement needs to be true

8 != 8 or 8 == 1

False

## Libraries

R *packages* are called *libraries* in Python. To install libraries in Python, use `pip install package_name` in the command line (this is, unfortunately, more complicated in Python than in R). The libraries are always installed in a specific *working environment* that you need to define beforehand.

In contrast to R, Python is programming language that was not primarily designed for data analysis. Thus, to perform many of the functions that are already included in base R, we need to import libraries such as `NumPy` or `pandas` for many data preparation exercises.

To import the packages in Python, use `import package_name` (this is again relateively easy). Forunately, several very comman libraries, such as NumPy or Pandas, come preinstalled, so you just have to import them.

In [None]:
# Import the package (done every session)
import pandas as pd

# To use pandas functions
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df)

   A  B
0  1  4
1  2  5
2  3  6


## Vectors

To use vectors, we need to load the NumPy library.


In [None]:
import numpy as np

variable_num = np.array([8, 8, 1])

print(variable_num)

[8 8 1]


You can also store strings with Python (or rather: a list of strings):

In [None]:
variable_char = ["a", "b", "c"]

print(variable_char)

['a', 'b', 'c']


To combine the numerical objects we created before, we again use NumPy:

In [None]:
variable_num = np.array([result_1, result_2, result_3])

print(variable_num)

[8. 8. 1.]


## Accessing elements in the list

Similar to R, we also use `[]` to recall a certain value of `variable_num`. However, while indexing in R starts with the value 1, Python uses 0-based indexing (like most programming languages).

In [None]:
variable_num[0]  # equivalent to variable_num[1] in R
variable_num[1]  # equivalent to variable_num[2] in R
variable_num[2]  # equivalent to variable_num[3] in R

# Side note: Nested indexing is a bit tricky in Python. To achieve something similar in Python, we would need to use:
nested_index = variable_num[int(variable_num[2])]
print(nested_index)

8.0


## Important types of variables in Python

- Logical: Binary variable with values `True` and `False`
- Character (string): Text (including symbols and numbers that are treated as text)
- Numeric: Numbers for mathematical operations (integers (`int`) and floating-point numbers (`float`))
- Factor: Typically a categorical data type in libraries like `pandas`.
- `NA` in R is `None` in Python, or `np.nan` with `NumPy`.


In [None]:
# Examples

# Logical
type(True)  # equivalent to class(TRUE) in R

# Character (String)
type("this is 1 character")  # equivalent to class("this is 1 character") in R

# Numeric
type(123)  # equivalent to class(123) in R

# Factor (Categorical)
import pandas as pd
category_example = pd.Categorical(["first gen immigrant", "second gen immigrant", "native"])
type(category_example)  # equivalent to class(factor(...)) in R


## Data frames

In Python, the equivalent concept of a data frame in R is also called a DataFrame, and it is provided by the `pandas` library.

In [None]:
import pandas as pd
import numpy as np

# Generate toy variables

# IDs
id = np.arange(1, 11)

# Set seed for reproducibility
np.random.seed(1608)

# Generate income variable
income = np.round(np.random.uniform(low=42192-10000, high=42192+10000, size=10))

# Generate migrant status variable
migrant = np.where(np.round(np.random.uniform(low=0, high=1, size=10)) == 0, "immigrant", "native")

# Generate birth year variable
birthyear = np.round(np.random.uniform(low=1960, high=2000, size=10))

# Combine into a data frame
toy_data = pd.DataFrame({
    'id': id,
    'income': income,
    'migrant': migrant,
    'birthyear': birthyear
})

# Display the data frame
print(toy_data)

   id   income    migrant  birthyear
0   1  37954.0     native     1978.0
1   2  46874.0  immigrant     1973.0
2   3  45180.0     native     1984.0
3   4  51873.0  immigrant     1999.0
4   5  50384.0  immigrant     1967.0
5   6  49760.0  immigrant     1970.0
6   7  46691.0  immigrant     1968.0
7   8  41815.0  immigrant     1961.0
8   9  47886.0     native     1971.0
9  10  46528.0     native     1968.0



To access a specific column in a DataFrame in Python (like `toy_data$migrant` in R) works the following way:

In [None]:
toy_data['migrant']

# Actually show the result:
print(toy_data['migrant'])

# Alternatively:
toy_data.migrant

0       native
1    immigrant
2       native
3    immigrant
4    immigrant
5    immigrant
6    immigrant
7    immigrant
8       native
9       native
Name: migrant, dtype: object


Unnamed: 0,migrant
0,native
1,immigrant
2,native
3,immigrant
4,immigrant
5,immigrant
6,immigrant
7,immigrant
8,native
9,native


To get a table with the numbers of each category, we will display the counts of "immigrant" and "native" in the migrant column of the toy_data DataFrame:

In [None]:
toy_data['migrant'].value_counts()

Unnamed: 0_level_0,count
migrant,Unnamed: 1_level_1
immigrant,6
native,4


## Recoding variables

How many observations have an income above the mean?



In [None]:
import pandas as pd
import numpy as np

# 'income' is already numeric

# Create a new variable 'inc_ab_mean' that indicates whether the income is above the mean
toy_data['inc_ab_mean'] = toy_data['income'] >= toy_data['income'].mean()

# Display the updated DataFrame
print(toy_data)

   id   income    migrant  birthyear  inc_ab_mean
0   1  37954.0     native     1978.0        False
1   2  46874.0  immigrant     1973.0         True
2   3  45180.0     native     1984.0        False
3   4  51873.0  immigrant     1999.0         True
4   5  50384.0  immigrant     1967.0         True
5   6  49760.0  immigrant     1970.0         True
6   7  46691.0  immigrant     1968.0         True
7   8  41815.0  immigrant     1961.0        False
8   9  47886.0     native     1971.0         True
9  10  46528.0     native     1968.0         True


## Indexing

In Python, using the pandas library, we can index data in a DataFrame similarly to R, but with some differences in the syntax:

- Use `iloc[row_index, column_index]` to access elements by their position
- Both rows and columns start from 0 in Python

In [None]:
# From our toy data

# All values of first "respondent"
print(toy_data.iloc[0, :])

# All values of first variable (ID)
print(toy_data.iloc[:, 0])

id                   1
income         37954.0
migrant         native
birthyear       1978.0
inc_ab_mean      False
Name: 0, dtype: object
0     1
1     2
2     3
3     4
4     5
5     6
6     7
7     8
8     9
9    10
Name: id, dtype: int64


## Subsetting data

Subsetting the data in Python works like this:


In [None]:
# Only keeping variables migrant status and yearly income in the data set:
variables = ["migrant", "income"]

toy_data_sub_var = toy_data[variables]

print(toy_data_sub_var)

# Only keeping the first five observations
toy_data_sub_obs = toy_data.iloc[0:5, :]

print(toy_data_sub_obs)

     migrant   income
0     native  37954.0
1  immigrant  46874.0
2     native  45180.0
3  immigrant  51873.0
4  immigrant  50384.0
5  immigrant  49760.0
6  immigrant  46691.0
7  immigrant  41815.0
8     native  47886.0
9     native  46528.0
   id   income    migrant  birthyear  inc_ab_mean
0   1  37954.0     native     1978.0        False
1   2  46874.0  immigrant     1973.0         True
2   3  45180.0     native     1984.0        False
3   4  51873.0  immigrant     1999.0         True
4   5  50384.0  immigrant     1967.0         True


## Functions

To replicate the R functions provided in the other colab, we can use the following code:

In [None]:
# Examples

# Mean income
mean_income = toy_data['income'].mean()
print(mean_income)

# Range of income
income_range = (toy_data['income'].min(), toy_data['income'].max())
print(income_range)

# Test for mean differences in income between immigrants and natives
from scipy import stats

# Subset the income data based on migrant status
income_immigrant = toy_data[toy_data['migrant'] == 'immigrant']['income']
income_native = toy_data[toy_data['migrant'] == 'native']['income']

# Perform the t-test
t_stat, p_value = stats.ttest_ind(income_immigrant, income_native, equal_var=False)

print(f"T-statistic: {t_stat}, P-value: {p_value}")

46494.5
(37954.0, 51873.0)
T-statistic: 1.321154185835615, P-value: 0.23802261332367058


## Tidyverse

Unfortunately, no equivalent in Python available (afaik). ☹