# Data Analysis in Python for Beginners

## The absolute basics
I've tried to make the instructions as detailed as possible for complete beginners. If you're already familiar with Python and Jupyter Notebooks, feel free to skim through this section or skip it entirely.









### Layout
This tutorial is created in a so-called Jupyter Notebook (file type .ipynb). It contains two types of "cells":
- Markdown cells (like this one) contain text and instructions.
- Code cells (see below) contain... well... code. You can execute them by clicking a small triangle or by clicking inside the cell and pressing SHIFT+ENTER or CTRL+ENTER.

What’s great about Jupyter Notebooks is that Markdown cells allow us to easily explain code, but more importantly, each code cell can be executed individually. This way, you can write and test your code piece by piece, seeing your output (or error message) immediately. This makes it easy to debug and learn.

**Try it out by running the code block below.**







In [None]:
print("Hello World!")

You will see the printed output below the code cell.

**Use the empty cell below to try out the `print` function yourself.**


### Defining Variables

Python—and all other programming languages—rely on variables or "objects" for computations. Simply put, each variable has a *value* and a *type*. In Python, the type is implicitly derived, making it a more beginner-friendly language since you don't have to explicitly declare the data type of each object.

Below, we define two variables, `a` and `b`.

- `a` is an *integer* with the value 5
- `b` is a *string* (i.e., text) with the value "seven."

**Run the cells below.**



In [None]:
a = 5
b = "seven"

### Retrieving Variables

We have now stored the values 5 and "seven" in the variables `a` and `b`. These stored variables can be retrieved by *calling* them.

**Run the cell below to call `a` and display its value on the screen.**

(Note, by default Jupyter will execute all code in your code cell but only show you the last variable or output unless you explicitly `print()`.


In [None]:
a

Now we also want to call the variable `b`.

**Create a new code cell directly below this one and call the variable `b`.**

You can create new cells by clicking the plus button on the screen.


We can define new variables based on previously defined ones. For example, we can define `c = a + 10`.

In [None]:
c = a + 10
c

Well done! 🥳 You are now familiar with the basics of the Jupyter interface and know how to define variables.


### Dealing with Errors

Python may be a *forgiving* programming language, but it is still a programming language. This means if you don't follow its rules, it will return an error and not execute your code.

This can be tricky when you're starting out, as even a single typo can cause your code to fail. Fortunately, Python provides hints about why your code failed. So, if you encounter an error, try to follow these steps:

- Relax. Errors happen all the time, even to the best of us.
- Read the error message (especially the beginning and end of the message; they often provide helpful clues).
- Try to pinpoint the cause of the error:
  - Sometimes the error message tells you directly.
  - Sometimes you just need to look closely.
  - Sometimes it helps to break down a more complex sequence of steps to see exactly where your code is failing.
- If you can't fix the code yourself, Stackoverflow.com is your friend. Type your error into Google and see what advice you find on Stackoverflow. Everyone does this.
- ChatGPT (and other AI assistants) have also become an excellent debugging tool. You can paste your code, ask it to explain what each line does, or have it automatically correct your code.
  - However, while it's great to write code with ChatGPT, make sure you’re not just copying and pasting; try to actually understand what makes the code work or not.
- If all else fails, consider asking someone else for help. Sometimes a fresh pair of eyes is all you need.

**Now, check what’s wrong with the following code and fix it.**





In [None]:
print("This is a really important message. Unfortunately, it won't execute.)

Let’s try a slightly trickier example. Do you remember your variables `a` and `b` that you defined earlier?

**Call them again to remind yourself of their values.**

(Note, by default Jupyter will only show you the most last variable. You might need to cells or first execute one variable, then edit the cell and execute the other to see both. Or use `print()`)


**Now define the variable `d = a + b`.**

Why isn’t it working? What can you do to fix the problem?

### Using Code Comments

Sometimes it’s helpful to write small notes in your code to explain what’s happening. To do this, type "#" at the beginning of your line. These lines will not be executed as code.

In [None]:
# This line is just a comment and will not be executed
print("This line will be printed")

You can also use comments to *comment out* entire blocks of code.
This can be helpful for debugging as well.


In [None]:
#print("Line 1")
#print("Line 2")
print("Line 3")
#print("Line 4")

### Yay!

You’ve mastered the basic steps 🥳🥳🥳 Let’s now do something more interesting now that you’re settled in.

If you want to save your progress, you can now save the file and download it locally if you want to. Use the UI to do this.


## Data Analysis in Python

The simplest and most common use case for data analysis in Python is *tabular data*. By this, I mean data that is in an *Excel* or *CSV* table format. These usually have one row per observation and one column per property or attribute of that observation.

For example, a medical study might record each participant on their own row and document their height, weight, gender, medication effects, etc., in different columns.

An accounting firm could record each transaction on its own row and write the sender, receiver, and amount in three columns.

In our example, we will start with a dataset of movies and their ratings, which we obtained from IMDb. You can download it from [Kaggle](https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows/data). Since an account is required, I have also provided the dataset for you.

### Finding Datasets
I have provided all necessary data sets for you in this course. If you are curious where to find more data for analysis projects beyond this course I suggest the following sources
- [Kaggle Datasets](https://www.kaggle.com/datasets): Many high quality datasets from entry level to advanced
- [GitHub Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets): Many real-world topical datasets. Tends to be more advanced
- Official agencies such as the [Statistisches Bundesamt](https://www-genesis.destatis.de/datenbank/online) but also UN, World Bank, etc.  


## A Few Words About Pandas

Data analysis is so common in Python that there is a collection of functions specifically designed for this purpose. These collections are also called *Packages* or *Libraries*. There are many of these in Python, each for a different purpose. Even for data analysis, there are several different ones. In this beginner course, we will use the most popular of these packages called *Pandas*.

A *Package* must always be imported before we can use the functions it contains. We do this at the beginning of our code and give it an alias so that we don't have to write out `pandas` every time. Instead, we can just use the common abbreviation `pd`.


In [None]:
import pandas as pd

### Reading Data

Reading data is very simple. You simply use the corresponding function `pd.read_excel(...)` or `pd.read_csv(...)` and provide the file path of the file you want to read. That’s it—your table is now loaded as a Python object. Tables that are read into Python using `pandas` are called *pandas DataFrames*, and the variable is often abbreviated as `df` for *dataframe*.

Make sure you've loaded the data file into the directory to the left. See demo video on GitHub if you're struggling with this. Make sure the file is unzipped before uploading it.

Once you run the cell below, you will see an overview of the first and last entries of the DataFrame.

In [33]:
# Make sure the file is in the same directory as this script
# You may need to upload the file by dragging it into the area on the left in the browser
# If necessary, adjust the file path (e.g., "C:/Users/my Name/PythonCourse/DataAnalysis/imdb_top_1000.csv")
pd.read_csv("imdb_top_1000.csv")


As usual, we need to assign the value to a *variable* so that we can reuse the DataFrame later in the code.

**Reload the table in the code cell below. This time, assign it to the variable `df`.**


In [None]:
df = # Complete this code

If everything worked, you can now call the DataFrame again by running `df`.

In [None]:
df

## First Pandas Operations: Head, Tail, Shape & Columns

A `pandas.DataFrame` comes with many built-in *functions* and *attributes* that we can take advantage of.

If we want to see only the first or last rows of the DataFrame, we use `head()` or `tail()`. This is often helpful when getting a sense of a new dataset.


In [None]:
df.head()

By default, `head()` displays the first 5 rows. However, we can pass a number as an *argument* to specify how many of the first rows we want to see.

**Display the first 3 rows by writing the desired number in the parentheses of the function**


**Now find out who the stars are in the sixth-to-last movie in the dataset.**

Hint: These guys are better known for their musical talent 😊🎵

Often, at the start, we want to know how large the dataset is. We can do this using `.shape`. Keep in mind that `.shape` is an *attribute* of the DataFrame, not a method. Therefore, unlike `head()`, we don’t use parentheses here.


In [None]:
df.shape

**If you’re unsure how to interpret the numbers above, feel free to ask your neighbor or me for help 😊**


You can get the column names using `.columns` (again, no parentheses).

In [None]:
df.columns

## Filtering

Filtering a dataset is one of the most common tasks when working with tabular data. Unfortunately, there are multiple ways to do this, which can make it a bit confusing for beginners. We’ll approach the different scenarios step by step.


### Filtering by Columns

This is the easy case. Maybe we want to view only one or a few columns of the dataset. The best way to do this is by using the column names (refer to the list of column names above).

In [None]:
df.Series_Title

In `pandas`, there are two common ways to filter columns. You can either use the method mentioned above, with `.your_column_name`, or use the following approach:

In [None]:
df["Series_Title"]

These methods are equivalent in result and come down to personal preference. There are three important differences:
- If a column has a space in its name, you must use the second method `["your column name"]` (or rename the column).
- If you want to select multiple columns, you must use the second method: `df[["column_1", "column_2"]]`
- If you want overwrite or create a column you must use the second method: `df["your_new_column"] = "Example"`

Since the first method (`.column_name`) is slightly quicker to write, I try to use it whenever possible. However, in online tutorials, you will often see both methods used.

**The use of `.column_name` and `["column_name"]` are often confusing to beginners. Make sure to read and understand the above section carefully**

If we want to select multiple columns, we do it like this:

In [None]:
df[["Series_Title", "Released_Year"]]

**Versuche selbst das Dataframe auf verschiedene einzelne Spalten sowie auf mehrere Spalten zu filtern, um dich mit der Syntax vertraut zu machen**

### Filtering by Rows

Filtering by rows is a bit more complicated. Let’s start with the simple case: we want to filter a specific range of rows.


In [None]:
# Gives us the first row in the data frame
df.iloc[0]

In [None]:
# Gives us rows 10 to 14
df.iloc[10:15]

In [None]:
# Gives us rows 26, 101 and 250
df.iloc[[26, 101, 250]]

**Probiere selbst aus, auf verschiedene Zeilen zu filtern**

### Filtern mit Bedingungen

It gets more interesting when we want to filter rows based on specific *conditions*. For example, we might want to display only movies with a certain meta-score. The syntax is a bit similar to *if-statements* you've already encountered in another tutorial: `df[MY_CONDITION]`



In [None]:
df[df.Meta_score > 95]

**Try filtering on the `IMDB_Rating` to reduce the dataframe to only those movies which have and ÌMDB_Rating` above 8.5**

You can combine this with the `.shape` attribute you learned earlier to return the number of rows after your filter is applied. Like so: `df[MY_CONDITION].shape[0]`

**Use this to return the number of movies with more than 1 million votes**

Alternatively, we can filter for movies with a specific rating. Here are the ratings explained by ChatGPT, in case you're interested:

- **nan**: Not a Number, likely missing or undefined data.
- **Unrated**: Not submitted for rating or not officially rated.
- **U**: Universal, suitable for all audiences.
- **G**: General audience, suitable for all ages.
- **Passed**: Approved by the Production Code Administration (historically).
- **Approved**: Approved by the Production Code Administration (historically).
- **GP**: General audience, parental guidance suggested (historically).
- **PG**: Parental guidance, some content may be unsuitable for children.
- **TV-PG**: Parental guidance, some content may be unsuitable for children.
- **TV-14**: Parents strongly cautioned, unsuitable for children under 14 years.
- **PG-13**: Parents strongly cautioned, some content may be unsuitable for children under 13 years.
- **UA**: Universal Adult, suitable for children over 12 with parental guidance.
- **U/A**: Universal Adult, suitable for children over 12 with parental guidance.
- **16**: Suitable for viewers over 16 years old.
- **R**: Restricted, viewers under 17 need to be accompanied by a parent or adult guardian.
- **TV-MA**: Mature audiences, unsuitable for children under 17.
- **A**: Adults only.

Now, we can filter the dataset based on a specific rating. For example, if we want to find movies rated "PG-13":

In [None]:
df[df.Certificate == "PG-13"]

**How many movies are recommended for children of any age without concerns?**

Tip: Refer to the above table to understand what conditions to filter on. You may need multiple lines of code to complete this challenge

### Filtering Rows with Multiple Conditions

Often, we want to apply multiple conditions simultaneously. It’s important to distinguish whether we want condition A **AND** condition B, or if we want condition A **OR** condition B.

In the following examples, we will look at the conjunctions **AND**, **OR**, and **NOT**. These three logical operators are enough to create any complex filter.

In [None]:
# We want to return movies which were released after 2012 AND before 2015.
# We use parentheses () for each partial condition and combine conditions using the "&" operator
df[(df.Released_Year > 2012) & (df.Released_Year <= 2015)]

Looking back at the above example, we want to find movies which are suitable for children of any age. Instead of writing two separate filters and adding the rows up (as you may have done) we can write everything in a single filter using the **OR** conjunction. See the code below:

In [None]:
# You'll probably find the vertical line | for the *OR* conjunction on the bottom left of your keyboard
# next to the Y- (or Z) key. Hold ALT-GR (to the right of the space bar) plus that key to type |
df[(df.Certificate == "U") | (df.Certificate == "G")]

Finally, I'll show you the **NOT** conjunction (=`~`) along with a slightly more complex filter.

**Try to understand which condition I have defined here:**

In [None]:
df[~(df.Genre.str.contains("Drama")) & (df.Meta_score > 85) & (df.Released_Year >= 2015)]

This is a very important lesson.

**Try different filters yourself. Set a specific question as your goal and then try to build the appropriate filter.**

**Start with the following questions:**

- **Which movies from the 90s have an IMDB rating over 8.0 and are not Sci-Fi films?**
- **I’m looking for an action movie with an IMDB rating over 8.0 or a Meta_score over 85.**

Note: Some columns are still incorrectly formatted. If you want to work with `Runtime` or `Gross` (box office revenue), we’ll need to transform these columns first. This is a bonus chapter I have prepared below.


In [None]:
# Probiere hier verschiedene Filter aus

In [None]:
# Probiere hier verschiedene Filter aus

In [None]:
# Probiere hier verschiedene Filter aus

## Optional: Converting Columns: Adjusting Data Types and Cleaning Data

First, we can display the data types of the columns. To do this, we use `.dtypes`.

Note: `object` means string in this case, `int64` is an integer, and `float64` is a float (a decimal number).


In [None]:
df.dtypes

Ideally, we want to encode the `Runtime` as an `int`. However, we first need to separate the number of minutes. There are many ways to do this, one of them is simply replacing " min" with an empty string.

In [None]:
df.head(2)

In [None]:
# We perform two operations in one line (first str.replace(), then astype(int))
# The result is then assigned to the "Runtime" column, overwriting it with the new result
df["Runtime"] = df.Runtime.str.replace(" min", "").astype(int)

Now we see that only the number remains in `Runtime`.

In [None]:
df.head(3)

**Check that the data type is now an integer**

**Now, try to transform the `Gross` column (the box office revenue of the film) into a `float`.**

Note: You need to find a solution for the commas in the numbers.

## Playground
Well done! 🥳🎉

You have learned everything I wanted to teach in this lesson. Now, it's important to apply and practice what you've learned.

**Try applying the content of this lesson as you like.**

**Bonus task: Create a new column `Rating_difference`, which holds the difference between the `IMDB_Rating` and the `Meta_score`. Important: The IMDB Rating ranges from 0-10, but Meta_score ranges from 0-100. So, first bring the two values to the same scale.**

Note: This is a transfer task because so far we've only overwritten existing columns. If you're stuck, seek help from the internet or from ChatGPT / Bing Copilot / Google Gemini.

Use this column for some interesting analyses 😊