# Python and Data Science

Python is open source, interpreted, high level language and provides great approach for object-oriented programming. It is one of the best language used by data scientist for various data science projects/application. Python provide great functionality to deal with mathematics, statistics and scientific function. It provides great libraries to deals with data science application.

One of the main reasons why Python is widely used in the scientific and research communities is because of its ease of use and simple syntax which makes it easy to adapt for people who do not have an engineering background. It is also more suited for quick prototyping.

![](https://www.brsoftech.com/blog/wp-content/uploads/2019/11/most-in-demand-programming-languages-2020.png)

# Is Python a New Language?

Python was first released in 1991. It was created by Guido van Rossum as a hobby project. 

It was named after a comedy TV series.

![Monty Python](https://upload.wikimedia.org/wikipedia/en/c/cd/Monty_Python%27s_Flying_Circus_Title_Card.png)

# Computing for Everybody
As python was becoming popular, Van Rossum submitted a funding proposal to DARPA called "Computer Programming for Everybody", in which he further defined his goals for Python:
- An easy and intuitive language just as powerful as major competitors
- Open source, so anyone can contribute to its development
- Code that is as understandable as plain English
- Suitability for everyday tasks, allowing for short development times

> In 2018, Python was the third most popular language on GitHub, a social coding website, behind JavaScript and Java.

According to a programming language popularity survey it is consistently among the top 10 most mentioned languages in job postings. Furthermore, Python has been among the 10 most popular programming languages every year since 2004 according to the TIOBE Programming Community Index.

# The Zen of Python

The Zen of Python is a collection of 19 "guiding principles" for writing computer programs that influence the design of the Python programming language. Software engineer Tim Peters wrote this set of principles and posted it on the Python mailing list in 1999. Peters's list left open a 20th principle "for Guido to fill in", referring to Guido van Rossum, the original author of the Python language. The vacancy for a 20th principle has not been filled.

- Beautiful is better than ugly.
- Explicit is better than implicit.
- Simple is better than complex.
- Complex is better than complicated.
- Flat is better than nested.
- Sparse is better than dense.
- Readability counts.
- Special cases aren't special enough to break the rules.
- Although practicality beats purity.
- Errors should never pass silently.
- Unless explicitly silenced.
- In the face of ambiguity, refuse the temptation to guess.
- There should be one—and preferably only one—obvious way to do it.
- Although that way may not be obvious at first unless you're Dutch.
- Now is better than never.
- Although never is often better than right now.
- If the implementation is hard to explain, it's a bad idea.
- If the implementation is easy to explain, it may be a good idea.
- Namespaces are one honking great idea—let's do more of those!

### Getting Started

Following line prints a message. To execute this line, select the line below and do one of the following:
1. Click the RUN button on the top
1. On the menubar on the top, click CELL > RUN CELLS
1. Press Ctrl + ENTER on the keyboard

In [None]:
print ("This is Python!")

Try to write a line yourself below. You can add an empty block of code using the following:
1. Click INSERT > INSERT CELL BELOW
1. Press ESC key on the keyboard to enter the command mode. The color on the left will change. Then press 'b' to add a cell 'below'

Write a statement that asks user their name and prints Hello NAME.

Hint: input() function prints a message and prompts user to enter a value that can be stored in a variable.

In [None]:
username = input("What is your name?")
print ("Hello "+username)

### Output in Jupyter Notebook

In [None]:
"This is Python!"

Ideally you would use `print()` statements to print the value of an object, or `display()` to print a `DataFrame`.

In [None]:
myval = 5

myval

What does the next piece of code do?

In [None]:
a = 10
b = 15

if b > a:
    print("B is greater")

The following block creates a `list` of items and runs a loop to print them.

In [None]:
fruits = ["Apple", "Banana", "Mango"]

for fruit in fruits:
    print ("I eat "+fruit) 

## 1.1. Discussion: Data Types

- What different forms of 'data' items do you expect to work with?
- What more elaborate yet basic arrangements of data do you know about?
- What operations can you perform on data items?

## 1.2. Arithmetic Operations

In [None]:
a = 5
b = 3

In [None]:
c = a + b
print ("a + b = ", c)

d = a - b
print ("a - b = ", d)

e = a * b
print ("a * b = ", e)

f = a/b
print ("a / b = ", f)

g = a**b # Power
print ("a ** b = ", g)

h = a//b # Integer division
print ("a // b = ", h)

In Jupyter, the objects are stored throughout the notebook. 

In [None]:
h

In [None]:
a = 1

If you run the previous block with arithmatic operators again, you will see that the numbers have changed because the value of `a` has been updated.

## SIDENOTE: What is Anaconda?

Anaconda is a free and open-source distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment. 

There are several alternatives, however Anaconda is the most popular due to simplicity of managing the python components.

Jupyter Notebook (formerly IPython Notebooks) is a web-based interactive computational environment for creating Jupyter notebook documents. The "notebook" term can colloquially make reference to many different entities, mainly the Jupyter web application, Jupyter Python web server, or Jupyter document format.

# Markdown

You can edit this block to see how it is formatted. Double-click anywhere on this text to activate.

This is text. This is not python code. This will not run. 

But this will be displayed properly.

# This is a heading

## This is a smaller heading

### This is an even smaller heading



Here're the reasons why you should use markdown cells:
1. It makes your notes look better
1. It helps other programmers understand what you are doing

## 1.3. Comments in Python

Comments are used to explain the code, make notes to help other programmers, or make notes for future scope.
They are mostly used to make code readable.

They are part of the code, not the markdown.

In [None]:
print ("Hello World!") # This is a string statement
print (5+9) # This is a number
print ('The end!') # Bye

In [None]:
def add_nums_and_print(num1, num2):
    """
    This function adds two numbers and prints the sum
    Parameters:
    num1: firstnumber
    num2: second number
    Returns: Sum of the two numbers
    """
    ans = num1+num2
    print ("The answer is "+str(ans))
    
add_nums_and_print(5,10)

In Python, you can use the help() function to see the details of the function.

In [None]:
help(print)

# 2. Problem Solving using Programming

Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. 

### data structure
An organization of data for the purpose of making it easier to use.
### immutable data value
A data value which cannot be modified. Assignments to elements or slices (sub-parts) of immutable values cause a runtime error.
### mutable data value
A data value which can be modified. The types of all mutable values are compound types. Lists and dictionaries are mutable; strings and tuples are not.


## 2.1. Data Structures in Python:
**1. Strings**
Collection of unicode characters. It is indexed and immutable, "hello world!"

**2. Lists**
Collection of elements. It is indexed and mutable. Allows duplicates, \[10,20,30\]

**3. Tuple**
Collection which can be indexed but immutable, (apple, 250)

**4. Set**
Collection of unordered elements that doesn't allow repetitions, {apple, orange}

**5. Dictionary**
Collection of Key-Value pairs, {key:value}

##  2.2. Control Flow

A program’s control flow is the order in which the program’s code executes. The control flow of a Python program is regulated by conditional statements, loops, and function calls. This section covers the if statement and for and while loops; functions are covered in the next class.

![](https://www.researchgate.net/profile/Kay_Smarsly/publication/322509045/figure/fig1/AS:583153716215809@1516046088625/Control-flow-of-elementary-control-structures.png)

### The if Statement

Often, you need to execute some statements only if some condition holds, or choose statements to execute depending on several mutually exclusive conditions. The Python compound statement if, which uses if, elif, and else clauses, lets you conditionally execute blocks of statements. 

### Comparision Operators
- x == y
- x!= y
- x < y
- x <= y
- x > y
- x >= y

## 2.3. Python Indentation
In Python, the code blocks are defined by a set of common or consistent number of spaces. This is called Python Indentation.

The block scope will end at the first un-indented line.

The best practice is to use on Tab space.

t or written in plain language

## 2.4. Developing Logic Flow in Python

When you write a program, Python is just the last step. Before we start typing Python code, we need to understand how to think through a problem. Programming is not about memorizing commands — it’s about breaking a real-world task into simple steps that a computer can follow. Here’s a simple way to approach a computational problem:

##### 1. Problem Definition
Problem definition is an essential initiating phase of any programming task (or even a subtask). Your goal here is not to think about Python, but rather clarify the primary requirements and ensure that you fully understand the task. You can start by asking yourself: 
    - What is the actual goal?
    - What should happen when the program runs from beginning to end?
    - What decision or calculation needs to happen?
    - Is the problem asking for a single result or several steps?
Try restating the problem in your own words. If you can't explain it, you're not ready to code yet.

##### 2. Identify the Inputs
A program can only work with the information it is given — it cannot assume, guess, or “fill in the blanks.”
Identifying inputs means figuring out exactly what data the program must receive before any processing can happen. To clarify the inputs, ask yourself:
    - What information does the program need from the user?
    - Are there any pieces of information that come from elsewhere?
    - What values must be given before we can calculate anything?
    - What kind of inputs might the user enter? What are there data types? 
    - Are some inputs optional?

##### 3. Identify the Output
Think about the final result, and how will that result be finally used. In the following problems, we simply want to print the result on the screen. In more complex cases, you will have to consider whether an end user, another programm, or a database will capture and process the output. 
    - What is the final result the program must produce?
    - Is the output a single value, multiple values, or a formatted message?
    - Should the output be a number, text, a list, or something else?

##### 4. Develop the Logic
When you understand the problem very clearly, you are aware of its boundaries, and you have identified the inputs and the outputs, now you can come up with structured rules that a computer can follow step by step. 
    - Identify the rules involved
    - Consider different logic flows depending on condentions
    - Explore any limits, thresholds, exceptions, and borderline conditions
    - Break complex logic into clear chunks in a flowchart or written in plain language

#### 2.4.1. Preparing for the First Challenge

Your organization offers reimbursement for travelling for work. Say, the reimbursement rate is €0.21 per km. Modify the program below to compute correct reimbursement a colleague should be offered given the total distance they travelled.

In [None]:
km = float(input("Enter the number of kilometers travelled: "))
rate = 0.15
reimbursement = km * rate

print("Total reimbursement: €", reimbursement)

Here are the changes 

#### 2.4.2. Developing the logic
A senior manager in the HR has come up with a new reimbursement policy, considering the fact that mode of transport carries different cost. You are requested to write a program that asks the user to select a mode of transport and distance travelled, and then provide the total amount that should be reimbursed. Here are the updated rules in the department:
- "car": €0.23 per km
- "bike": €0.06 per km
- "public": €0.19 per km

Complete the program below to develop this logic.

In [None]:
mode = input ("Enter Mode of transport : ")
mode = mode.lower()
distance = input ("Enter Distance Travelled : ")
distance = float(distance)
if mode=="car":
    reimbursement = distance * 0.23
elif mode=="":
    # complete this part

print ("Reimbursement Offered: €", round(reimbursement, 2))

The following block shows an alternate way to prepare your logic with the help of data structures. This logic still has some flaws but makes your program more compact, easy to debug, and easy to update policies.

Adding new item in a dictionary: 

If you add an item that already exists, the previous item is overwritten. 

Operations on Python Dictionaries:
- `policy['flight'] = 0.33` - Adds a new item to the dictionary
- `policy['car'] = 0.25` - Adds a new item; if the item is already present, overwrites the value
- `policy.keys()` - Print names of all the available transport modes.
- `policy.values()` - Print values assigned to the modes
- `policy.items()` - Shows all the items as a list of tuples

Check the example below and make changes in the dictionary to explore more.

In [None]:
policy = {"car":0.23, "bicycle": 0.06, "public": 0.19, "unspecified": 0.10}

mode = "car"
distance = 8

reimbursement = distance * policy[mode]
print ("Reimbursed: ", reimbursement)

policy['flight'] = 0.33
print (policy)
policy['car'] = 0.25
print (policy)

You can add a new item in the dictionary.

#### 2.4.3. Group Task - Develop Final Policy
After the initial rollout of the department's travel reimbursement policy, the department found out that more employees would consider using public transport if the local travel is encouraged. Here is the final reimbursement policy:
- Car: €0.23 per km
- Bicycle: €0.06 per km
- Public transport:
  - If total distance is less than 10 km, flat €2
  - If the distance is more than 10 km, €0.19 per km, upto a maximum total of €60
 
Problem Understanding (5 minutes):
- Individually, read about the problem solving process in section 2.4 above.
- Write how you would define the problem in a simple sentence. You do not have to explain the detailed policy rules.
- Identify inputs and outputs, their data types, and any concerns you might have about them

Discussion and development of the logic flow (5 minutes):
- Pair with a colleague and provide feedback on your partner's problem definition, inputs and outputs
- Develop the logic flow in the form of a flow chart, block diagram, or plain language
- Modify the code in the cells below and develop your logic.

**Hint:**
Functions, defined using `def` keyword are an efficient way to modularize your code and develop your logic in modules, or blocks with their own independent responsibility. Each function can take inputs (optional), implement the logic, and return one final answer. 

We have already seen some functions above, e.g. `float(distance)` converts raw keyboard input to a decimal number. `print("hello")` is a function that takes one or more objects as an input, print them on the screen, and returns a `None` object.



In [None]:
def compute_reimbursement(mode, distance):
    """
    Compute reimbursement based on mode of transport and distance.

    Parameters
    ----------
    mode : str
        Normalized transport mode. Expected values include:
        - "car"
        - "bicycle"
        - "public"
        - anything else (default/fallback case)

    distance : float
        Distance travelled in kilometers. Must be >= 0.

    Returns
    -------
    float
        The reimbursement amount in euros.
    """
    return reimbursement


def get_transport_mode():
    """ 
    Ask user for a mode of transport. Perform required validation, cleaning, case folding, and return the final mode.
    If you want more challenge, consider asking yourselves how would you check for one of the allowed modes, or ask again if the input is invalid
    """
    mode = input("Enter mode of transport (car, bicycle, public): ")

    # add cleaning + validation logic here.
    mode = mode.strip().lower()

    # TODO: validate mode; if invalid, ask again or return a default

    return mode   
    
def get_distance_travelled():
    """ Ask user for the distance travelled """
    # Hint: You'll likely want a loop:
    # while True:
    #     value = input("...")
    #     try:
    #         distance = float(value)
    #         if distance >= 0:
    #             return distance
    #         else:
    #             print("Distance cannot be negative.")
    #     except ValueError:
    #         print("Please enter a valid number.")

    return


mode = get_transport_mode()
distance = get_distance_travelled()
reimbursement = compute_reimbursement(mode, distance)

print("Reimbursement Offered: €", round(reimbursement, 2))

## Bonus: Qr Code in Python

In [None]:
%pip install qrcode

import qrcode

img = qrcode.make('https://www.uu.nl/')

img.save("tempqr.png")

In [None]:
policy = {"car":0.23, "bicycle": 0.06, "public": 0.19, "unspecified": 0.10}


In [None]:
for i in range(10):
    km = round(random.random()*35)
    rate = policy[random.choice(list(policy.keys()))]
    print (round(rate*km, 2))

# 3. Towards Data Analysis

### 3.1. Summarizing Datasets 
Imagine we collected all the travel reimbursements for employees at an office location on a day. This data can be simply represented in a list with each entry corresponding to an item in the list. 

To describe what the dataset contains, you can report:
- Number of items in the data (number of reimbursements)
- What else?

In [None]:
utrecht_reimbursements = [0.78, 2.94, 1.48, 3.49, 3.15, 1.89, 1.34, 5.03, 2.78, 1.13, 2.35, 3.51, 1.68, 3.37, 2.6]

In [None]:
total = sum(utrecht_reimbursements)
count = len(utrecht_reimbursements)
mean_utrecht = total / count
mean_utrecht

#### 3.1.2. Using Numerical Analysis Tools

NumPy, short for Numerical Python, addresses the limitations of traditional Python lists when it comes to numerical computations. Python lists are flexible but lack the optimized structure needed for handling large datasets and performing complex mathematical operations. NumPy bridges this gap by introducing a powerful array object that facilitates vectorized operations and enhances computational efficiency.

In [None]:
import numpy as np
utrecht_reimbursements = np.array(utrecht_reimbursements)
utrecht_reimbursements

In [None]:
# Prints the max value in the array
np.max(utrecht_reimbursements)

# Prints the mean of the array
np.mean(utrecht_reimbursements)

# Prints the mean of the array, another usage
utrecht_reimbursements.mean()

#### 3.1.2a. 2-D arrays

A 1-dimensional array in NumPy is simply a sequence of values arranged in a single line. You can think of it like a list of numbers: reimbursements for a day, temperatures measured hourly, or the ages of employees in a department. These arrays are the simplest form of numerical data storage and are extremely common in everyday data analysis. Because they store values in a continuous block of memory and all elements share the same data type, operations like summing, averaging, or applying mathematical functions are extremely fast.

A 2-dimensional array extends this idea into a table-like structure with rows and columns. This is the format most people are familiar with from spreadsheets or CSV files. Each row can represent an observation (e.g., one employee’s daily travel data), while each column represents a variable (e.g., distance, reimbursement amount, transport mode encoded as a number). NumPy's 2D arrays allow you to perform operations across rows or columns efficiently, making them essential for tasks like statistical summaries, matrix multiplication, filtering, and preparing data for machine learning models. This structure also lays the groundwork for linear algebra, image representation, and large-scale tabular datasets.

A 3-dimensional array adds another axis, allowing you to stack multiple 2D arrays on top of each other. This format is crucial when working with datasets that contain sequences or layers—such as multiple days of tabular data, frames in a video, or RGB image data where the three color channels form a depth dimension.

![https://i1.wp.com/indianaiproduction.com/wp-content/uploads/2019/06/NumPy-array.png?resize=768%2C368&ssl=1](https://i1.wp.com/indianaiproduction.com/wp-content/uploads/2019/06/NumPy-array.png?resize=768%2C368&ssl=1)

The following show basic operations with 2D Arrays using Numpy. 

In [None]:
ll = [[1,2,3],[4,5.0,6]]

arr = np.array(ll)
arr.shape
type(arr)
arr

In [None]:
print (arr[0]) # Prints the first row
print (arr[1]) # Prints the second row
print (arr[:,:1]) # Prints the second column
print (arr[1,1]) # Prints the element in second row, second column

In [None]:
column_means = np.mean(arr, axis=0)
row_means = np.mean(arr, axis=1)

### 3.2. Group Task: Analyze Reimbursements from two cities

Now you are given reimbursements from both Amsterdam and Utrecht. Work in pairs and discuss the prompts below. 
- What kind of analytics "questions" could you answer based on this data?
- What analysis could you perform?

Perform the analysis and come up with your own interpretation of the results. Discuss your observations with the class.

In [None]:
utrecht_reimbursements = [0.78, 2.94, 1.48, 3.49, 3.15, 1.89, 1.34, 5.03, 2.78, 1.13, 2.35, 3.51, 1.68, 3.37, 2.6]
amsterdam_reimbursements = [0.39, 1.94, 0.97, 1.8, 6.36, 0.00, 0.91, 5.19, 3.67, 0.81, 3.79, 0.12, 1.85, 4.63, 5.49]

ut = np.array(utrecht_reimbursements)
am = np.array(amsterdam_reimbursements)

In [None]:


# TODO: compute mean for each city
# TODO: compute median for each city
# TODO: compute standard deviation for each city

In [None]:
mean_ut = np.mean(ut)
mean_am = np.mean(am)

median_ut = np.median(ut)
median_am = np.median(am)

std_ut = np.std(ut)
std_am = np.std(am)

mean_ut, median_ut, std_ut, mean_am, median_am, std_am

### 3.3. Data Visualization

Visualization gives shape to patterns—clusters, outliers, trends, sudden jumps—that are almost invisible in raw data. It’s a way of letting the data 'speak' without forcing people to dig through rows of numbers.

In our reimbursement example, you can compute the mean and median for Amsterdam and Utrecht, and that’s helpful—but with a boxplot or KDE plot, you can immediately see which city has more spread, where the typical values lie, and whether one city has more extreme cases. Visual summaries make it much easier to communicate results to colleagues, especially those who don’t love statistics as much as the rest of us.

Another advantage is that visualization helps you catch mistakes early. Sometimes a sensor fails, a stray zero sneaks in, or a student mistypes “600” instead of “6.00”. A quick plot can instantly reveal impossible values or weird patterns that indicate something is off.

Visualization strengthens decision-making. Whether you’re comparing office performance, monitoring travel expenses, or looking for seasonal patterns, a well-chosen chart can highlight the signal and hide the noise. It reduces cognitive load and makes complex datasets feel intuitive.

Look at the visualizations below and reflect on your observations in the previous section.

You can creade a simple histogram showing how different values of reimbursements are spread in the dataset.

In [None]:
plt.hist(ut, bins=5)

In [None]:
import matplotlib.pyplot as plt

plt.hist(ut, bins=8, alpha=0.6, label="Utrecht")
plt.hist(am, bins=8, alpha=0.6, label="Amsterdam")
plt.xlabel("Reimbursement (€)")
plt.ylabel("Frequency")
plt.legend()
plt.title("Histogram of Reimbursements")
plt.show()


In [None]:
plt.boxplot([ut, am], labels=["Utrecht", "Amsterdam"])
plt.ylabel("Reimbursement (€)")
plt.title("Distributions of reimbursements across the two cities")
plt.show()

Hints for 3.2: 
- Are reimbursements in one city generally higher than in the other? 
- How similar or different are the distributions?
- Are there outliers (unusually high/low values)?
- What reimbursement amounts are “typical” for each city?
- How do your statistical results change when you combine the reimbursements from both the cities?

## 3.4. Reading Data from Files

CSV (Comma-Separated Values) files are one of the most common formats for sharing and storing tabular data. Whether you're working with reimbursement logs, employee records, survey responses, or financial data, there is a good chance it will arrive as a CSV. In Python, the easiest and most powerful way to work with CSV files is through the Pandas library. Pandas provides a high-level interface for loading, inspecting, and manipulating data — all in a way that feels very similar to working with Excel, but much faster and more flexible.

The first step is to import Pandas and use the function pd.read_csv(), which reads a CSV file directly into a DataFrame. A DataFrame is a Pandas data structure that behaves like a table: it has labeled columns, indexed rows, and supports filtering, calculations, and visualizations. Once loaded, you can inspect the first few rows, view column names, check data types, or compute summaries of your dataset.

In [None]:
import pandas as pd

df = pd.read_csv("reimbursements.csv")
df.head()


Pandas tries to automatically detect separators, headers, and types, but you can specify these explicitly when needed. For example, if the file uses semicolons instead of commas, or if the first row doesn’t contain column names, Pandas can easily accommodate that:

```
df = pd.read_csv("datafile.csv", sep=";", header=None)
```

You can also filter rows, select columns, compute new values, or visualize data directly from the DataFrame. Because CSV reading is so quick and consistent, Pandas becomes the foundation of most data workflows in Python. It’s the first step before cleaning, analyzing, plotting, or exporting results back into another file format.

Explore the dataset and answer the following questions
- How many rows are displayed by `.head()`?
- Do the column names match what you expected?
- How many employees are there in the dataset?

Once the data is loaded into a DataFrame, it becomes very easy to explore:
```
df.info()        # column types, number of rows
df.describe()    # basic statistics
df.columns       # list of column names
```
- Which columns are numerical?
- Which columns are categorical (non-numeric)?
- How many rows are there in the dataset?

# 4. Exploratory Data Analysis

### 4.1. Fetch a more detailed version of the dataset

### 4.2. Perform basic Exploratory analysis on the dataset
- What locations are present in the dataset?
- What transport modes are present in the dataset?
- How many unique employees are in the dataset?

**Hint:** Explore these functions.
```
df['column'].unique()
df['column'].nunique()
df['column'].value_counts()
```

Now answer the following:
- What is the average travel distance?
- What is the most popular mode of transport?
- What is the mean reimbursement amount?
- What is the median reimbursement amount?
- Which metric (mean or median) seems higher? Why might that be?

```
df['column'].mean()
df['column'].median()
df['column'].mode()
```

In [None]:
df.groupby("location")["distance_km"].mean()

### 4.3. Aggregations

Now that we understand the basic structure of the dataset, we can begin answering more meaningful questions by grouping and aggregating the data.
Pandas makes this easy with the `groupby()` function, which lets us compute statistics for different categories such as locations or transport modes.

Let us approach a question: What is the average travel distance per location?
Different office locations may have different travel patterns. To find out how far employees travel on average in each city, we can group the data by the "location" column and compute the mean distance.

`df.groupby("location")["distance_km"].mean()`

This line of code:\
(i) Splits the dataset into four groups (Amsterdam, Utrecht, Rotterdam, Den Haag),\
(ii) Calculates the mean distance for each group, and\
(iii) Returns a summary listing each location and its corresponding average distance.


In [None]:
df.groupby("location")["distance_km"].mean()

#### 4.3. Case Example

> A colleague from the accounting has mentioned that the organization needs to reduce the reimbursement rate for car as it is significantly raising the organization's expenses. The leadership is considering softly forcing the employees to consider public transport more as it might save the costs. You are being asked to verify if using car is indeed more costly to the company.

Each transport mode has a different reimbursement rate (car > public > bike). You can approach it via the following questions:
- Does the company spend the highest reimbursements on cars?
- Does car travel produce the highest average reimbursement?
- Is a particular kind of transport more stable compared to other options?
- Are bike reimbursements consistently low due to shorter distances and lower rates?

In [None]:
df.groupby("transport_mode")["reimbursement_eur"].mean()

#### 4.4. Case Example contd...

> Your previous analysis challenged the leadership’s assumption that car travel was the most expensive mode and should therefore be discouraged. While the report showed that cars can be costly, it also revealed that certain cities rely heavily on bikes, and public transport reimbursements have been increasing due to longer commutes.
> This has caused some unrest among leadership: they now fear that shifting employees away from cars might not produce the savings they expected, and may even create new budget issues in cities with long-distance employees. Leadership has asked you for a deeper analysis before they make any policy changes. They want to understand how travel patterns differ across locations, whether certain offices rely too heavily on expensive modes, and whether employee behavior—rather than reimbursement rate alone—is driving costs upward.


> The conversation has moved beyond “cars are expensive” to questions like are there certain cities that rely heavily on public transport because employees live far away? Or is it possible that there is a city where employees can easily commute with bikes but they prefer cars instead, and the organization can come up with policies to encourage them in particular? Now you are being asked to come up with observations that can help the leadership make more data informed decisions.

You can explore the analysis tasks mentioned in the following blocks of code.

##### Hint:

First consider which column or columns do you want to group by. You can also group 
- `df.groupby("transport_mode")` - groups based on different transport modes
- `df.groupby(["location", "transport_mode"])` - groups based on all possible combinations of location and transport modes

Then think about which column do you want to analyze, and what aggregation (eg. sum, mean, etc.) do you want to apply. 
- `df.groupby("transport_mode")["reimbursement_eur"].sum()` - provides sum of all the reimbursed amounts for each transport mode

Finally, depending on your approach to the problem, you can decide how to present the results, e.g. sort the values using `sort_values()` method.

You may have to define new columns based on existing columns, e.g. to show how reimbursement per kilometer differs per different modes of transport, you can create a new column using `df['reimb_per_km'] = df["reimbursement_eur"] / df["distance_km"]`

-------------------

There are more methods provided by pandas. Crosstabulation is a common exploratory process where you can show the distribution of rows across two different categorical columns. The following block shows how many reimbursements were made across the cities over different transport modes. You can also convert these numbers to percentages across each city using: `(pd.crosstab(df["location"], df["transport_mode"], normalize="index")*100).round(2)`.



Does the organization spend the highest reimbursements on cars?

Which mode has the highest cost per kilometer on average?

Which mode shows the largest variability in reimbursement (unstable cost patterns)?

Are bike reimbursements consistently low due to shorter distances and lower rates?

Which locations rely most heavily on expensive travel modes (car or public)?

In [None]:
# df.groupby("transport_mode")["reimbursement_eur"].sum()
# (df["reimbursement_eur"] / df["distance_km"]).groupby(df["transport_mode"]).mean().sort_values(ascending=False)
# pd.crosstab(df["location"], df["transport_mode"])
# df.groupby("transport_mode")["reimbursement_eur"].std().sort_values(ascending=False)
# df.groupby(["location", "transport_mode"])["distance_km"].mean()
# df.groupby("transport_mode")[["distance_km","reimbursement_eur"]].mean()
# pd.crosstab(df["location"], df["transport_mode"], normalize="index")[["car","public"]].sum(axis=1).sort_values(ascending=False)
# df.groupby("transport_mode")["reimbursement_eur"].sum().sort_values(ascending=False)
# df.groupby(["location", "transport_mode"])["reimbursement_eur"].sum()

#### 4.4.1. Visualization to summarize information
Visualization showing the percentage of transport modes across the cities.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

ct = pd.crosstab(df["location"], df["transport_mode"], normalize="index")*100
sns.heatmap(ct, annot=True, cmap="Blues")
plt.title("City × Transport Mode (% of Trips)")
plt.show()

#### 4.5. Case Example contd...

> The results indicate that there might be some special cases that should be thoroughly understood, and used to form more inclusive reimbursement policies that reduce the organization's costs while not affecting the employee well being.  

Filtering is essential for targeted questions. The following example reports the number of reimbursements applied per city for all the bike trips that were shorter than 5 km. 
`df[(df["transport_mode"] == "bike") & (df["distance_km"] < 5)]`



In [None]:
short_trips = df[(df["transport_mode"] == "bike") & (df["distance_km"] < 5)]
city_counts = short_trips.location.value_counts()
city_counts

> Which cities have shorter trips made by cars?

In [None]:
short_trips = df[(df["transport_mode"] == "car") & (df["distance_km"] < 5)]
city_counts = short_trips.location.value_counts()
city_counts

> Are there any transport modes that are more on average more expensive in a particular city?

In [None]:
df.groupby(["location", "transport_mode"])["reimbursement_eur"].mean()

> Are there any particular employees that make expensive reimbursement claims?

In [None]:
expensive = df[df["reimbursement_eur"] > df["reimbursement_eur"].quantile(0.95)][["employee_id", "reimbursement_eur"]].sort_values(by="reimbursement_eur", ascending=False)
expensive.employee_id.value_counts()

> Which employees make a lot of public trips?

In [None]:
public_usage = (
    df[df["transport_mode"] == "public"]
      .groupby(["employee_id", "location"])
      .size()
      .reset_index(name="public_count")
      .sort_values("public_count", ascending=False)
)

public_usage.head(20)

# Conclusion

Share your thoughts with the class:
- What was your most favourite part of the session today?
- What can you do confidently?
- What would you like to practice next?
- How prepared do you feel about the advanced courses?
- Problem solving and logic development you can comfortably do
- Advanced courses you can take next 