<a href="https://colab.research.google.com/github/andreaskuepfer/.github/blob/main/Week%201/Week_1_A_Getting_Started_(ML_for_Social_Scientists_MZES_2024).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Machine Learning for Social Scientists - MZES 2024**

**Week 1 Lab: Getting Started (2024-09-09)**

---


*Ruben Bach (Data and Methods Unit, MZES Mannheim)*

*Andreas Küpfer (Data and Methods Unit, MZES Mannheim & Institute for Political Science, TU Darmstadt)*

## First Steps in Python

Welcome to Week 1 of the lab!

Implementing your Machine Learning pipeline is still much more comfortable in Python than in R due to better package availability and seamless integration with GPU access. to While for some of you this is going to be the first time you working with Python, others already might have some experience. For this very first part of our workshop series you do not need any local installations of Python as we work in the interactive browser environment Google Colab. You're currently working in a so-called "Juypter Notebook" (recognizable by the file type *.ipynb*). These are quite similar to RMarkdown Notebooks and we are going to use them locally on your machines later during this lab.

Let's begin with just a few basics of Python. Execute the code chunks below by either clicking the "Start" button on the left side of each chunk or pressing "CTRL+Enter" (or "CMD+Enter" for Mac users) after clicking on a chunk:

In [64]:
print('This is week', 1, 'of Machine Learning for Social Scientists (MZES 2024)')

This is week 1 of Machine Learning for Social Scientists (MZES 2024)


As you see, printining output to the screen is straightforward. You can even directly include other data types than *string* (similar to *character* in R) such as numbers directly.

Let's explore some basic mathematical calculations:

In [50]:
2000+24

2024

Concatenating *strings* and assigning the result to a new variable works the following way (and there are also many others to achieve the same result):

In [51]:
variable_string = "Machine" + " " + "Learning"

It is important to note, that *strings* belong to the type of sequences. Other types of sequences in Python are *lists* and *tuples*. We can make use of this fact by selecting parts of the *string* by its index inside a bracket:

In [52]:
variable_string[1]

'a'

Should'nt we see 'M' insetad of 'a'? No, as Python indexes start at 0. Hence, if you want to address the first element of a sequence, you have to use 0 as the index:

In [53]:
variable_string[0]

'M'

Similarly, we can use lists for various types of variables, e.g., integers. We can even add multiple lists. Or can we really do that? Let's try it:

In [54]:
x = [3, 4, 5]
y = [6, 7, 8]

x + y

[3, 4, 5, 6, 7, 8]

This doesn't look right. Python only concatenated our two lists instead of adding. While this seems conterintuitive, it highlights that Python is a general-purpose language applicable to much more than statistical analysis. Lists can hold different types of objects that are always concatenated as we did earlier with our *string* example.

This is one reason to rely on one of the 500'000 packages in Python (and, similar to R, there are tons of further reasons to do so). Probably the most important packages in Python are `numpy` and `pandas`.

Starting with `numpy` we can import all of its functions at once with the following import command:

In [55]:
import numpy as np

# # 1) Alternative to import all functions:
# from numpy import * as
# # 2) Alternative to import one or multple functions:
# from numpy import array, sin
# #
# # IMPORTANT NOTE on imports: It is recommended and common to use "import numpy
# #    as np" for numpy as this means you have to call functions using the suffix
# #    "np.". Both alternatives do not require this (i.e. you would call a function
# #    directly by its name), carrying the risk of name clashing with other packages.
# #    If you don't want to use the recommended way you should rely on Alternative
# #    2) to keep track of imported functions.

We start using numpy by recycling our concatenation exmaple. As we still want to add the element in the *list*, we should transform it to a numpy array, giving us a multidimensional set of numbers:

In [56]:
x = np.array([3, 4, 5])
y = np.array([6, 7, 8])

x + y

array([ 9, 11, 13])

In Machine Learning we work with multidimensional arrays a lot (e.g., embedding representations of text, video, or audio). Defining them is very intuitive:

In [57]:
z = np.array([[1, 2], [3, 4]])
z

array([[1, 2],
       [3, 4]])

In [58]:
z.dtype

dtype('int64')

In [59]:
z.ndim

2

In [60]:
z.shape

(2, 2)

Yielding the data type, number of dimensions and the shape of our array is straigtforward. Numpy also provides an easy functon to sum up all numbers of an array object:

In [61]:
z.sum()

10

The following example has two purposes: Getting to know the *tuple* type as well as another important fuction of the numpy universe: `reshape()`

In [62]:
print('beginning z:\n', z)
z_reshape = z.reshape((4, 1)) # rows, columns - at least this is familiar...
print('reshaped z:\n', z_reshape)

beginning z:
 [[1 2]
 [3 4]]
reshaped z:
 [[1]
 [2]
 [3]
 [4]]


To now access a certain element in our reshaped numpy array, we can rely on multidimensional indexing (again, keep in mind that indexes in Python start at 0!):

In [63]:
z_reshape[2, 0]

3

This was an elementary introduction to Python's numpy. While pandas probably is **the** package to load datasets, working with data is much more easy in a local environment on your machine. Other reasons for needing a local Python distribution are package versions, reproducibility, and easier access to GPUs to do resource-intensive work. Unlike R, in Python, you should carefully assign package versions when installing them. Otherwise, you might end up with version clashes between several package dependencies. Beyond that, working with an isolated environment per project in Python is very common. This means you install packages in a so-called *Virtual Environment* or *Conda Environment*. Environments do not only help you in keeping track of packages but also ensure the reproducibility of your findings on other machines.

## Installing Python on Your Machine

This means you now want to install Python locally. The following guideline will help you succeed in setting up a working local installation.

#### 1. Download Python

1. Open your web browser and go to the official Python website: [python.org](https://www.python.org/).
2. Navigate to the Downloads section.
3. The website typically detects your operating system and suggests the appropriate version. Click the download button to get the latest stable release.

#### 2. Install Python on Your System

##### **For Windows**

1. Locate the downloaded Python installer (`python-<version>.exe`) and run it.
2. In the installer window:
   - Check the box labeled "Add Python to PATH".
   - Click on "Install Now" for a standard installation or "Customize installation" if you need specific options.
3. The installer will run and install Python. Once completed, click "Close".

##### **For macOS**

1. Locate the downloaded Python installer (`python-<version>-macosx<macosx_version>.pkg`) and run it.
2. Follow the installation instructions, which will guide you through the steps.
3. After installation, Python should be available in your Applications folder and accessible from the terminal.

##### **For Linux**

Python is often pre-installed on many Linux distributions. However, to install the latest version, follow these steps:

1. Open your terminal.
2. Update your package list:
   ```bash
   sudo apt update
   ```
3. Install Python using the package manager. For example, on Debian-based systems like Ubuntu, use:
   ```bash
   sudo apt install python3
   ```
4. Verify the installation by checking the Python version:
   ```bash
   python3 --version
   ```

#### 3. Verify Installation

After installation, ensure Python is correctly installed by verifying the version.

1. Open a command prompt (Windows) or terminal (macOS/Linux).
2. Type the following command and press Enter:
   ```bash
   python --version
   ```
   or
   ```bash
   python3 --version
   ```
3. You should see the Python version number printed on the screen.

Congratulations! You now have a working installation of Python on your local machine.

## Setting up a Package Manager on Your Machine

The easiest way to install packages in R is by simply calling `install.packages(<list_of_packages>)`. The R package `pacman` offers a more comprehensive way to manage packages by handling not only installation of CRAN and GitHub packages but also load them into R. You could see `pacman` as a light version of Python package managers. While package managers in R are more 'nice to have', in Python they are essential.


The Python univserse developed two dominant package management solutions. First, `pip` is the package installer for Python (see *(Optional) Install pip to check if you need to install `pip` or if it's already available). It allows you to install and manage additional libraries (e.g., numpy, pandas, or torch) and dependencies that are not included in the standard library (similar to base R).

#### *(Optional) Install pip

`pip` is a package manager for Python that allows you to install additional libraries and packages.

- For Windows and macOS, `pip` is included with the Python installation.
- For Linux, you might need to install `pip` separately:
  ```bash
  sudo apt install python3-pip
  ```

Before installing your first Python package, let's discuss project environments. You could install all your packages without any isoloated environment to manage dependencies. However, as mentioned earlier this bears great risks of version clashes between different projects as package versions in Python (and even the Python version itself) do matter. A lot!

This is why we now set up a so-called virtual environment for this workshop. On most of the machines you don't need to install any addtional software to do this. For the following lines of code you have to go back to your command prompr/terminal.

1. (Only for Linux users!) Install the `venv` module if it's not already available:
   ```bash
   sudo apt install python3-venv
   ```
2. Create a virtual environment:
   ```bash
   python -m venv css_mzes2024
   ```
3. Activate the virtual environment:
   - On Windows:
     ```bash
     css_mzes2024\Scripts\activate
     ```
   - On macOS/Linux:
     ```bash
     source css_mzes2024/bin/activate
     ```

Now use `pip` to install additional packages within your envrionment as needed. For example:
```bash
pip install numpy
```

Use the following command to deactivate your virtual environment:

```bash
deactivate
```

Deleting it works as follows (**Be very (!) careful with this command as it deletes all files including directory in the given directory. rm -r depends on the current working directory you're in.**):

```bash
rm -r css_mzes2024/
```

The following lab sessions use virtual environments. However, we encourage you to try out `conda` as a easy-to-use and comprehensive package manager. Please find both an installation guideline and a demo of most common functions at the very end of this Google Colab.

## Local version of Google Colab: Jupyter Notebooks

The merits of Google Colab are chunk-by-chude code execution useful to try out new code, packages and learning cutting-edge computational social science methods before converting your code to a script file to be executed on a cluster. One of its huge drawbacks is its difficult package management. Good news: You can combine virtual envriomments or `conda` with a local version of Google Colab called *Juypter Notebook*.

The following guideline helps you to set up *Juypter Notebook* and activate your local package environment. And -- we promise -- it's the last installation routine for this lab session.

#### 1. Install Jupyter Notebook (Only once)

This tool is really easy to install and you don't need any additional software. Just install the following package via pip (**after deactivating your current virtual environment if loaded to install jupyter notebook globally!**):

```bash
pip install notebook
```

#### 2. Activate Your Local Package Environment

Now, reactivate your environment, install `ipykernel`, and add your environment (don't forget to replace \<your_env_name\> with your actual environment) the a new so-called kernel:

```bash
pip install ipykernel
ipython kernel install --user --name=<your_env_name>
```

#### 3. Run Jupyter Notebook and Select Kernel

Finally, run an instance of Jupyter Notebook to create your first own Notebook and activate your kernel (named after your virtual environment):

```bash
jupyter notebook
```

To shut down your Jupyter server, press CTRL+C and type 'Y' (for Yes). Don't worry; if you saved your Notebooks, they are safely stored on your local disk and can be reaccessed when you start Jupyter again.

## (Optional) Alternative to Virtual Environments: `conda`

`conda` needs to be installed separately from Python and primarily focuses on packages from data science, which this workshop will focus on. While this sounds like a restriction, `conda` additionally offers a great overview of all environments and packages in its own UI. Additionally, it is also easier to keep track of all your environments via the command prompt/terminal. It's not even restricted to Python and lets you combine packages from different programming languages (yes, also R).

The following guideline helps you set up `conda` and introduces some of its basic functionalities if you want to try it out.

#### 1. Download the Anaconda or Miniconda installer

`conda` is included in both the Anaconda and Miniconda distributions. Anaconda is a larger distribution with many pre-installed packages, while Miniconda is a minimal installer with just `conda` and its dependencies.

##### Anaconda:

- Download the Anaconda installer from the [Anaconda Distribution page](https://www.anaconda.com/products/distribution). (You are required to provide an email to receive a download link)

##### Miniconda:

- Download the Miniconda installer from the [Miniconda page](https://docs.conda.io/en/latest/miniconda.html).

#### 2. Install Anaconda or Miniconda

##### **For Windows:**

1. Run the downloaded installer.
2. Follow the installation instructions, and make sure to check the box to add Anaconda or Miniconda to your PATH environment variable.

##### **For macOS:**

1. Open a terminal and navigate to the directory where the installer is downloaded.
2. Run the installer using:

    ```bash
    bash Miniconda3-latest-MacOSX-x86_64.sh
    ```

    or

    ```bash
    bash Anaconda3-latest-MacOSX-x86_64.sh
    ```

3. Follow the prompts to complete the installation.

##### **For Linux:**

1. Open a terminal and navigate to the directory where the installer is downloaded.
2. Run the installer using:

    ```bash
    bash Miniconda3-latest-Linux-x86_64.sh
    ```

    or

    ```bash
    bash Anaconda3-latest-Linux-x86_64.sh
    ```

3. Follow the prompts to complete the installation.

#### 3. Verify Installation

To verify the installation of `conda`, open a new terminal or command prompt and run:

```bash
conda --version
```

You should see the version of `conda` installed.

#### Creating a Conda Environment

To create a new conda environment, use the `conda create` command followed by the name of the environment and any packages you want to install.

```bash
conda create --name css_mzes2024
```

#### Activating a Conda Environment

To activate the conda environment you created, use the `conda activate` command followed by the name of the environment.

```bash
conda activate css_mzes2024
```

Once activated, the environment name will appear in your terminal prompt, indicating that the environment is currently active. You can now install packages using one of the following commands:

```bash
conda install numpy
```

or (if a package is not available with conda)

```bash
pip install numpy
```

#### Deactivating a Conda Environment

To deactivate the currently active conda environment, use the `conda deactivate` command.

```bash
conda deactivate
```

This command will return you to your base environment.

#### Listing Conda Environments

To see a list of all conda environments on your system, use the `conda env list` command.

```bash
conda env list
```

#### Removing a Conda Environment

To remove a conda environment, use the `conda remove` command followed by the `--name` flag and the name of the environment, along with the `--all` flag to remove all packages in the environment.

```bash
conda remove --name css_mzes2024 --all
```

#### Exporting and Importing Environments

##### Exporting an Environment

To export the environment's configuration to a YAML file, use the `conda env export` command:

```bash
conda env export --name css_mzes2024 > css_mzes2024.yml
```

##### Importing an Environment

To create a new environment from an exported YAML file, use the `conda env create` command:

```bash
conda env create --file css_mzes2024.yml
```

You now have the knowledge to create, activate, deactivate, list, remove, export, and import conda environments. These commands will help you manage your project dependencies efficiently and keep your development work organized.