# MLOps Project: Environment Setup

## Introduction
In this notebook, the development environment will be set up and all necessary tools for the project Binary Classification of Income (over/under 50,000) using the "Adult income dataset" will be installed.

## Lernziele
After completing this notebook, the following will be achieved:
- Have a working Python environment with all required packages
- Understand the basic project structure
- Have initialized a Git repository
- Have verified the functionality of all MLOps tools

## 1. Environment setup

### 1.1 Create a virtual environment
These commands will be executed in the terminal:


```bash
python -m venv mlops-venv
# Unter Windows
.\mlops-venv\Scripts\activate
# Unter Unix/MacOS
source mlops-venv/bin/activate
```

### 1.2 Install dependencies
The following packages are installed: 

In [49]:
# cell 1: Install the required packages
!pip install numpy pandas scikit-learn mlflow pytest fastapi uvicorn great-expectations docker python-dotenv matplotlib seaborn




Defaulting to user installation because normal site-packages is not writeable


### 1.3 Checking installation
To verify if all packages are correctly installed, you can use the following code snippet in your Jupyter notebook. This code will attempt to import the necessary packages and print a success message if all imports are successful:

In [3]:
# Zelle 2: Import und Versionscheck
import sys
import numpy as np
import pandas as pd
import mlflow
import great_expectations as ge
from fastapi import FastAPI
import pytest
from sklearn.impute import SimpleImputer

# Versionen ausgeben
print(f"Python Version: {sys.version}")
print(f"NumPy Version: {np.__version__}")
print(f"Pandas Version: {pd.__version__}")
print(f"MLflow Version: {mlflow.__version__}")
print(f"Great Expectations Version: {ge.__version__}")

Python Version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:17:27) [MSC v.1929 64 bit (AMD64)]
NumPy Version: 1.26.4
Pandas Version: 2.1.4
MLflow Version: 2.20.2
Great Expectations Version: 1.3.6


## 2. Projekt structure
The project structure will be as follows:
```
adult_income/
├── data/
│   ├── raw/
│   └── processed/
├── notebooks/
│   ├── 00_Umgebung_Einrichtung.ipynb
│   ├── 01_Daten_Exploration.ipynb
│   ├── 02_Daten_Vorverarbeitung.ipynb
│   ├── 03_Modell_Engineering.ipynb
│   └── 04_Modell_Deployment.ipynb
├── src/
│   ├── data/
│   ├── features/
│   ├── models/
│   └── api/
├── tests/
├── .gitignore
├── README.md
└── requirements.txt
```

The structure is created:

In [4]:
# Zelle 3: Projektstruktur erstellen
import os

def create_project_structure():
    # Verzeichnisstruktur definieren
    directories = [
        'data/raw',
        'data/processed',
        'notebooks',
        'src/data',
        'src/features',
        'src/models',
        'src/api',
        'tests'
    ]
    
    # Verzeichnisse erstellen
    for dir_path in directories:
        os.makedirs(dir_path, exist_ok=True)
        print(f"Verzeichnis erstellt: {dir_path}")

create_project_structure()

Verzeichnis erstellt: data/raw
Verzeichnis erstellt: data/processed
Verzeichnis erstellt: notebooks
Verzeichnis erstellt: src/data
Verzeichnis erstellt: src/features
Verzeichnis erstellt: src/models
Verzeichnis erstellt: src/api
Verzeichnis erstellt: tests


## 3. Git setup

### 3.1 Git Repository initialising
The commands are executed in terminal: 


```bash
git init
```

### 3.2 create .gitignore 
.gitignore file is created:

In [7]:
# Zelle 4: .gitignore erstellen
gitignore_content = """
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg

# Virtuelle Umgebung
mlops-venv/
venv/
ENV/

# Jupyter Notebook
.ipynb_checkpoints

# MLflow
mlruns/

# Daten
data/raw/*
data/processed/*
!data/raw/.gitkeep
!data/processed/.gitkeep

# IDE
.idea/
.vscode/
"""

with open('.gitignore', 'w') as f:
    f.write(gitignore_content)
print(".gitignore Datei wurde erstellt")

.gitignore Datei wurde erstellt


## 4. Download dataset
The dataset Adult Income is downloaded from here: https://www.kaggle.com/datasets/wenruliu/adult-income-dataset/data

In [5]:
#https://raw.githubusercontent.com/vladislabv/fhswf-mlops-project/refs/heads/1_Datenverarbeitung_lip/adult.csv

# Zelle 5: Datensatz herunterladen
import pandas as pd

url = "https://raw.githubusercontent.com/vladislabv/fhswf-mlops-project/refs/heads/1_Datenverarbeitung_lip/adult.csv"
df = pd.read_csv(url)
df.to_csv('data/raw/adult-income.csv', index=False)
print("The dataset was downloaded and saved in data/raw/adult-income.csv")


The dataset was downloaded and saved in data/raw/adult-income.csv


The following columns are given in the dataset: 

Age, workclass, fnlwgt, education, educational-num, marital-status, occupation, relationship, race, gender, capital-gain, capital-loss, hours-per-week, native-country, income

In [6]:
# First, we will get an overview about the dataset

print(df.info())
print(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       48842 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   48842 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB
None
   age  workclass  fnlwgt     education  educational-num      marital-s

There are values missing. The missing values are marked with '?' in the dataset. 
The parameters used in the dataset for adult income prediction are: 

- age: the age of an individual
- workclass: a general term to represent the employment status of an individual
- fnlwgt: final weight. This is the number of people the census believes the entry represents..
- education: the highest level of education achieved by an individual.
- education-num: the highest level of education achieved in numerical form.
- marital-status: marital status of an individual.
- occupation: the general type of occupation of an individual
- relationship: represents what this individual is relative to others.
- race: Descriptions of an individual’s race
- sex: the sex of the individual
- capital-gain: capital gains for an individual
- capital-loss: capital loss for an individual
- hours per week: the hours an individual has reported to work per week
- native country: country of origin for an individual



In the next step, we will start with data exploration.