# <span style="color:darkblue;">[LDATS2350] - DATA MINING</span>

### <span style="color:darkred;">Python00 - First Steps in Python</span>

**Prof. Robin Robin Van Oirbeek **  

---

**<span style="color:darkgreen;">Guillaume Deside</span>** (<span style="color:gray;">guillaume.deside@uclouvain.be</span>)  

# 1. Introduction to Python

**Python** is a high-level programming language created by **Guido van Rossum** (a Dutch programmer) and first released in **1991**. Over time, Python has evolved into one of the most popular languages for various applications, including **data mining** and **machine learning**.

---

## Why Python 3?

- **Two main versions**: Python 2 and Python 3.  
- Python 3.0 (released in **December 2008**) introduced significant but **non–backward compatible changes**.  
- **Python 2 End of Life**: Officially reached on **January 1, 2020**, meaning no further updates or maintenance.  
- **Recommendation**: Always use a recent **Python 3** release for new projects.

---

## Interpreted Language

Unlike many languages that require a compilation step (C, C++, Java, etc.), **Python is interpreted**:
- Your code is executed directly by an **interpreter**, without creating machine code binaries.
- **Faster development cycle**: write code → run immediately → see results.

---

## Open Source Philosophy

Python is **open source**, meaning:
- Anyone can view and contribute to its source code.
- This fosters a **large community** that actively develops libraries, bug fixes, and new features.

---

## Why Python for Data Mining?

1. **Rich Ecosystem of Libraries**:  
   Python offers libraries like **NumPy**, **pandas**, **scikit-learn**, **matplotlib**, and many more. These tools provide powerful functions for data manipulation, statistical analysis, and machine learning, making Python an ideal choice for **data mining** workflows.

2. **Easy to Learn & Use**:  
   - The language is renowned for its **readable syntax** and **straightforward** structure.
   - Quick to prototype: you can implement data mining algorithms with fewer lines of code, helping you focus on **analysis** rather than low-level details.

3. **Huge Community & Resources**:  
   - A large user community offers extensive **tutorials**, **online forums**, and **packages**.  
   - Many data mining and machine learning **frameworks** are Python-first or provide strong Python interfaces.

4. **Integration & Flexibility**:  
   - Python can integrate with databases, web services, and other tools, streamlining end-to-end data mining projects.  
   - It supports diverse programming paradigms, including **object-oriented**, **imperative**, and **functional**.

---

## Python's Object-Oriented Paradigm

Although you can program in a procedural style, Python supports **Object-Oriented Programming (OOP)**:
- OOP helps **reduce system complexity** via **reusable** code components.
- **Objects** combine data (attributes) and procedures (methods).
- These objects are instantiated from **classes**, which define the blueprint of the objects’ data and behavior.

By combining **object orientation**, an **intuitive syntax**, and a **vast data analysis ecosystem**, **Python** is a powerful and accessible language for **data mining** tasks.


## 1.1 Python IDE (Integrated Development Environment)

While you can execute Python scripts directly from the command line (e.g., running `python <program>.py`), this approach does not easily scale for **larger projects** or **iterative exploration**—both of which are common in **data mining**. An **IDE (Integrated Development Environment)** or a dedicated code editor is typically preferred to manage projects, navigate code, and debug more effectively.

### Jupyter Notebook

In this course, we will make extensive use of **Jupyter Notebook**:

- **Origin**: Jupyter was born from the IPython project in 2014.  
- **Structure**: It is a web application based on a **client-server** architecture, allowing you to create and manipulate documents known as *notebooks*.  
- **Notebook Features**:  
  - A notebook file is composed of *cells* which can be executed **independently** (using `Ctrl+Enter` or `Shift+Enter` to run and move to the next cell).  
  - All cells share the **same memory** (namespace). This means if you declare or modify a variable in one cell, you can access it in another cell—provided you run the former before the latter.  

### Why Use Jupyter Notebooks in Data Mining?

1. **Interactive Data Exploration**:  
   Data mining workflows often involve quickly testing ideas, plotting intermediate results, and revising steps. Jupyter's cell-based execution allows you to:
   - Tweak data cleaning steps or model parameters on the fly.  
   - Immediately visualize results (e.g., with libraries like matplotlib or seaborn).  

2. **Reproducibility and Collaboration**:  
   - Notebooks combine **code, outputs, and narrative** (markdown text, equations, etc.) in a single document.  
   - This makes it easier to share your data mining process (data preparation, exploratory analysis, modeling, etc.) with others.  

3. **Ease of Debugging and Prototyping**:  
   - You can isolate segments of logic (data transformation, model fitting) in different cells.  
   - If something goes wrong in your data pipeline or model training, only re-run the relevant cell instead of the entire script.  

Given these benefits, Jupyter Notebook is particularly well-suited for **iterative, experimental** data mining tasks where **rapid feedback** and **in-code documentation** are essential. Go ahead and **try** executing the cells in the provided notebook to see how variables can be defined in one cell and re-used in another!

In [None]:
a = 5

In [None]:
print(a)

Notebooks show a sequential structure (each cell is executed after the previous one). They allow to display the results of computation using rich media representations, such as HTML, LaTeX, PNG, SVG. They are very usefull to communicate results.

However, for really big projects, the use of an IDE that offers code completion or code insight by highlighting, resource management, debugging tools will be helpuf. Visual Studio Code, PyCharm or Spyder are some of them.

## 2. Packages

We strongly recommend installing **Anaconda** to set up your Python environment. Anaconda is available for the three main operating systems (Linux, Windows, and macOS). It comes with a powerful **package manager** called `conda`, which simplifies the download and installation of additional libraries. (You can also use `pip`, but mixing `pip` and `conda` within the same environment can sometimes lead to dependency conflicts.)

Once Anaconda is installed, open a terminal (or Anaconda Prompt on Windows) and install packages such as **scikit-learn**, **numpy**, **pandas**, and **matplotlib**. For example:

```bash
conda install scikit-learn numpy pandas matplotlib
```

*(You could also use `pip install <packages>`, but try to stay consistent with one approach.)*

---

## 3. Syntax

Python syntax **does away with curly braces** (like `{ }`) for code blocks. Instead, it relies on **indentation** to define blocks (e.g., within `if`-`else`, `for` loops, function definitions, etc.). The **colon** (`:`) indicates where a new indentation block starts.

Example:

```python
a = 5

if a == 5:
    print("a is equal to 5")
else:
    print("Sadly a is not a beautiful 5 anymore :'(")
```

Note how the block under `if a == 5:` is indented at least one level to indicate the code that belongs to that condition.


In [1]:
a = 5
if a == 5:
  print("a is equal to 5")
else:
  print("Sadly a is not a beautiful 5 anymore :'(")

a is equal to 5


## 4. Data Types

In Python, **data types** define the kind of values a variable can hold. Two common examples are **numeric** types and **boolean** types.

---

### 4.1 Numeric

Python supports several numeric data types, including:
- **int** (integer)
- **float** (floating-point)
- **complex** (complex numbers with a real and imaginary part)

**Examples:**

In [2]:
a = 5
print("Value of a =", a, "| Type of a:", type(a))

b = 5.0
print("Value of b =", b, "| Type of b:", type(b))

c = 2 + 4j
print("Value of c =", c, "| Type of c:", type(c))

Value of a = 5 | Type of a: <class 'int'>
Value of b = 5.0 | Type of b: <class 'float'>
Value of c = (2+4j) | Type of c: <class 'complex'>


### 4.2 Boolean

A **Boolean** type can only have one of two values: `True` or `False`. These values are heavily used for **logical** operations and **conditional** checks in Python.

In [3]:
is_sunny = True
print("Value of is_sunny =", is_sunny, "| Type of is_sunny:", type(is_sunny))

is_raining = False
print("Value of is_raining =", is_raining, "| Type of is_raining:", type(is_raining))

Value of is_sunny = True | Type of is_sunny: <class 'bool'>
Value of is_raining = False | Type of is_raining: <class 'bool'>


<div style="border: 2px solid darkblue; padding: 10px; background-color: #89D9F5;">

##### **Exercise**

1. Create three variables in Python:
   - One integer `my_int` (e.g., `my_int = 7`)
   - One float `my_float` (e.g., `my_float = 3.14`)
   - One boolean `my_bool` (e.g., `my_bool = True`)

2. Write a small script to ask the user to guess which data type each variable has.
 ```python
   guess = input("What type is 'my_float'? (int, float, complex, bool): ")
   if guess == "float":
       print("Correct!")
   else:
       print("Nope, it's a float.")
 ```

3. Print out the actual types of the three variables using `type()`, and confirm whether the user's guesses are correct.

4. Modify one of your variables to another data type (e.g., turn an integer into a float). What changes do you need to make in your code, if any, to handle this modification?

**Goal**: This exercise helps you practice:
- Declaring variables of different data types.
- Using `type()` to inspect Python’s type system.
- Writing simple conditional logic with booleans (i.e., comparing user input to the correct answer).

</div>

#### **Solution to the Data Types Exercise**

Below is an example solution that follows the four steps outlined in the exercise. You can copy this code into a Python file or Jupyter Notebook and run it. Feel free to customize variable names, prompts, and messages.



In [5]:
# Step 1: Create three variables of different data types
my_int = 7
my_float = 3.14
my_bool = True

# Step 2: Write a script to ask the user about each variable's type
# For demonstration, we prompt the user to guess the type of `my_float`.
guess_float = input("What type is 'my_float'? (int, float, complex, bool): ")

if guess_float.lower() == "float":
    print("Correct! my_float is indeed a float.")
else:
    print("Nope, it's a float.")

# Let's also ask the user about `my_int` and `my_bool`
guess_int = input("What type is 'my_int'? (int, float, complex, bool): ")
if guess_int.lower() == "int":
    print("Correct! my_int is indeed an int.")
else:
    print("Nope, it's an int.")

guess_bool = input("What type is 'my_bool'? (int, float, complex, bool): ")
if guess_bool.lower() == "bool":
    print("Correct! my_bool is indeed a boolean.")
else:
    print("Nope, it's a boolean (bool).")

# Step 3: Print out the actual types
print("\n--- Actual Types ---")
print("Type of my_int =", type(my_int))
print("Type of my_float =", type(my_float))
print("Type of my_bool =", type(my_bool))

# Step 4: Modify one variable to another type
# Let's change 'my_int' into a float by just assigning a float value:
my_int = 7.0
print("\nAfter modification, my_int =", my_int, "and its type is", type(my_int))

# Note:
# If we re-ran the same guessing logic, the correct answer for my_int would no longer be 'int' but 'float'.
# However, there's no need to change other parts of the code unless our logic specifically depends on 'my_int' being an integer.

Correct! my_float is indeed a float.
Correct! my_int is indeed an int.
Correct! my_bool is indeed a boolean.

--- Actual Types ---
Type of my_int = <class 'int'>
Type of my_float = <class 'float'>
Type of my_bool = <class 'bool'>

After modification, my_int = 7.0 and its type is <class 'float'>


### 4.3 Sequence Type
In Python, a **sequence** is an ordered collection of similar or different data types. Sequences allow you to store multiple values in an organized, efficient manner. There are several built-in sequence types:

1. **String**: An array of bytes representing Unicode characters.
2. **List**: An array-like structure that can contain multiple data types (e.g., integers, strings, objects). Lists are **mutable** (their elements can be changed).
3. **Tuple**: Similar to a list, but **immutable** (their elements cannot be changed after creation).

These three types support **slicing** and **indexing** using `[ ]`. Keep in mind that Python uses **zero-based indexing**.

---

#### **Strings**

In [None]:
# Creation of a string 
myString = "An example for LDATS2350 practical session"
print("An example of a String:", myString, "\n")

Key Point: Strings are immutable in Python; you can't change individual characters once the string is created.

---

#### Lists

In [None]:
# Creation of a list 
myList = ["An", "example", "for", "LDATS2350", "practical", "session"]
print("An example of a List:", myList)

# Demonstrate mutability
myList[3] = "LDATS2340"
print("Lists are mutable:", myList, "\n")


Key Point: Lists can be modified in place (e.g., element reassignment, insertion, deletion).

---

#### Tuples


In [None]:
# Creation of a tuple 
myTuple = ("An", "example", "for", "LELEC2900", "practical", "session")
print("An example of a Tuple:", myTuple)

# Demonstrate immutability
try:
    myTuple[3] = "LELEC2800"
    print("Tuples are not mutable:", myTuple, "\n")
except Exception as e:
    print("Tuples are immutable. The system raised the following error:", e, "\n")


**Key Point**: Tuples are immutable; once created, their elements cannot be reassigned.

---

#### Sets

A **set** is an unordered collection of data types that is:
- **Iterable**
- **Mutable**
- **Has no duplicate elements**

Unlike lists or tuples, **sets** do not preserve any particular order, and they do not support indexing or slicing with `[ ]`.




In [None]:
# Creation of a set 
mySet = {"An", 8, "example", "for", True, "LELEC2900", "practical", "session"}
print("An example of a Set:", mySet)

**Key Point**:  
- Sets are optimized for **membership testing** (e.g., `if "LELEC2900" in mySet:`).  
- Since sets are unordered, you cannot do `mySet[0]`.

---

#### **Dictionary**

A **dictionary** is an unordered collection of **key-value pairs**. It is often used to store large volumes of data and is optimized for quick **key-based** lookups.

In [None]:
# Creation of a dictionary
myDict = {
    1: "An",
    2: "example",
    3: "for",
    'key': "LELEC2900",
    'keybis': "practical",
    'keyter': "session"
}
print("An example of a Dictionary:", myDict)


**Key Point**:  
- Dictionaries allow you to retrieve values by referencing their **keys**, e.g. `myDict['key']` returns `"LELEC2900"`.  
- Keys can be of different types (integers, strings, etc.) as long as they are **hashable**.

---

<div style="border: 2px solid darkblue; padding: 10px; background-color: #89D9F5;">

##### **Exercise**

Create a script that manipulates these data structures:

1. **Strings**:  
   - Define a string called `course_title` (e.g., `"LDATS2350 - Data Mining"`) and print its length using `len(course_title)`.

2. **Lists**:  
   - Create a list of at least 5 elements (mix strings and integers).  
   - Replace the 3rd element of the list with something new.  
   - Print the modified list.

3. **Tuples**:  
   - Define a tuple with exactly 3 elements.  
   - Attempt to change one of those elements (handle or display any error message that arises).

4. **Sets**:  
   - Create a set that contains at least one string, one integer, and one boolean.  
   - Check if a certain element (e.g., a string) is in the set, and print the result.

5. **Dictionaries**:  
   - Create a dictionary that maps **course codes** to **course names** (e.g., `'LDATS2350': "Data Mining"`).  
   - Print the value associated with `'LDATS2350'`.  
   - Add a new key-value pair and print the entire dictionary again.

**Bonus**:  
- Write a short piece of code that asks the user for a **dictionary key** and prints out the corresponding value if it exists. If it doesn’t exist, print an appropriate message.

</div>


##### **Solution to the Exercise**

Below is an example solution to each step, showing how to create and manipulate the different data types in Python. You can copy this code into a script or a Jupyter Notebook cell.


In [6]:
# 1. STRINGS
course_title = "LDATS2350 - Data Mining"
print("Course title:", course_title)
print("Length of course_title:", len(course_title))

# 2. LISTS
my_list = ["Hello", 42, "Python", 3.14, "LDATS2350"]
print("\nOriginal list:", my_list)

# Replace the 3rd element (index 2) with something new
my_list[2] = "Changed"
print("List after modification:", my_list)

# 3. TUPLES
my_tuple = ("A", "B", "C")
print("\nOriginal tuple:", my_tuple)

# Attempt to change the second element (index 1)
try:
    my_tuple[1] = "Z"
except TypeError as e:
    print("Tuples are immutable! Error message:", e)

# 4. SETS
my_set = {"LDATS2350", 7, True, "Data"}
print("\nCreated set:", my_set)

# Check membership
element_to_check = "Data"
if element_to_check in my_set:
    print(f"'{element_to_check}' is in the set.")
else:
    print(f"'{element_to_check}' is NOT in the set.")

# 5. DICTIONARIES
course_dict = {
    "LDATS2350": "Data Mining",
    "LDATS1000": "Intro to Python"
}
print("\nDictionary of courses:", course_dict)

# Print value associated with 'LDATS2350'
print("Value associated with 'LDATS2350':", course_dict["LDATS2350"])

# Add a new key-value pair and print the dictionary
course_dict["LDATS3000"] = "Advanced Machine Learning"
print("Updated dictionary:", course_dict)

# BONUS
print("\n--- BONUS ---")
user_key = input("Enter a course code to look up: ")

if user_key in course_dict:
    print(f"Value for '{user_key}': {course_dict[user_key]}")
else:
    print(f"'{user_key}' does not exist in the dictionary.")

Course title: LDATS2350 - Data Mining
Length of course_title: 23

Original list: ['Hello', 42, 'Python', 3.14, 'LDATS2350']
List after modification: ['Hello', 42, 'Changed', 3.14, 'LDATS2350']

Original tuple: ('A', 'B', 'C')
Tuples are immutable! Error message: 'tuple' object does not support item assignment

Created set: {True, 'Data', 'LDATS2350', 7}
'Data' is in the set.

Dictionary of courses: {'LDATS2350': 'Data Mining', 'LDATS1000': 'Intro to Python'}
Value associated with 'LDATS2350': Data Mining
Updated dictionary: {'LDATS2350': 'Data Mining', 'LDATS1000': 'Intro to Python', 'LDATS3000': 'Advanced Machine Learning'}

--- BONUS ---
Value for 'LDATS2350': Data Mining
