In [1]:
import pandas as pd

### Reading in our data

In [2]:
RawData = pd.read_csv('vgchartz-2024.csv')

### Looking at our code from the last chapter. 

The code processes the RawData DataFrame by:

- Converting the release_date column to `datetime`

- Removing specified **columns**: 'img' and 'last_update'

- Drops rows with missing 'release_date' 

- Fills missing values with **zeros** in certain columns

- Converts the 'console' column values to lowercase

```python
RawData['release_date'] = pd.to_datetime(RawData['release_date'])
RawData.drop(['img', 'last_update'], axis=1, inplace=True)
RawData = RawData.dropna(subset=['release_date'])
columns_to_fill_zero = ['critic_score', 'total_sales', 'na_sales', 'jp_sales', 'pal_sales', 'other_sales']
RawData.loc[:, columns_to_fill_zero] = RawData.loc[:, columns_to_fill_zero].fillna(0)
RawData['console'] = RawData['console'].str.lower()
```

---

To ensure consistency and efficiency across multiple files, we will encapsulate these lines of code into a function. This approach allows us to easily reuse the code, maintain it in one place, and simplify our workflow.

**Pros:**

- Reusability: Encapsulating the code into a function allows you to reuse it across multiple files or sections of your project, reducing redundancy.

- Maintainability: Changes to the data processing logic need to be made in only one place, making the code easier to maintain and update.

- Readability: A function with a descriptive name can make your code more readable and self-explanatory, enhancing overall clarity.

- Modularity: Functions help in organizing code into modular components, which can improve the structure and organization of your codebase.

**Cons:**

- Overhead: Introducing functions adds a layer of abstraction, which might slightly increase complexity, especially for simpler tasks.

- Debugging: If the function contains errors, debugging might be more challenging since you need to check both the function and its usage in different contexts.

- Performance: Although generally negligible, function calls introduce a minor performance overhead compared to inline code execution.


In [3]:
"""
Preprocesses the given DataFrame by performing the following steps:
- Converts 'release_date' to datetime.
- Drops specified columns.
- Drops rows with missing 'release_date'.
- Fills missing values with zero for specified columns.
- Converts 'console' column values to lowercase.

Parameters:
df (pd.DataFrame): The DataFrame to preprocess.

Returns:
pd.DataFrame: The preprocessed DataFrame.
"""
def preprocess_data(df):
    df = df.copy()  # Create a copy to avoid modifying the original DataFrame
    df['release_date'] = pd.to_datetime(df['release_date'])
    df.drop(['img', 'last_update'], axis=1, inplace=True)
    df = df.dropna(subset=['release_date'])
    columns_to_fill_zero = ['critic_score', 'total_sales', 'na_sales', 'jp_sales', 'pal_sales', 'other_sales']
    df.loc[:, columns_to_fill_zero] = df.loc[:, columns_to_fill_zero].fillna(0)
    df['console'] = df['console'].str.lower()
    return df

In [4]:
preprocessed_RawData = preprocess_data(RawData)
preprocessed_RawData.head(1)

Unnamed: 0,title,console,genre,publisher,developer,critic_score,total_sales,na_sales,jp_sales,pal_sales,other_sales,release_date
0,Grand Theft Auto V,ps3,Action,Rockstar Games,Rockstar North,9.4,20.32,6.37,0.99,9.85,3.12,2013-09-17


#### How to call the function

```python
preprocessed_RawData = preprocess_data(RawData)

```

`preprocess_data(RawData)`: This calls the preprocess_data function and passes RawData as an argument.

`preprocessed_RawData`: The result of the function, which is a processed DataFrame, is assigned to this variable.

```python
preprocessed_RawData.head(1)

```

`preprocessed_RawData.head(1)`: This displays the first row of the preprocessed_RawData DataFrame.

**Summary**

`preprocess_data(RawData)`: Applies the preprocessing function to RawData, returning the modified DataFrame.

`preprocessed_RawData.head(1)`: Shows the first row of the resulting DataFrame after preprocessing.

---

### <strong>Moving Functions to a New <code>.py</code> File</strong>

While it’s not always ideal to keep functions hidden away in separate files, there are situations where this practice is beneficial. For instance, if you're using a notebook primarily for data presentation, keeping your code organized can prevent the notebook from becoming cluttered with numerous lines of code. This makes the notebook cleaner and easier to follow for readers. Additionally, as we build upon previous notebooks in each chapter, maintaining functions in a separate file can help streamline the development process and improve code readability.


#### **Steps to Move Functions and Import Them**



1. **Create a New Python File**

    * **Create a new file**: Save the new file with a `.py` extension, for example, `data_preprocessing.py`.

**Add functions**: Copy your function definitions into this file. For example: 
```python \
# data_preprocessing.py

import pandas as pd

"""
    Preprocesses the given DataFrame by performing the following steps:
    - Converts 'release_date' to datetime.
    - Drops specified columns.
    - Drops rows with missing 'release_date'.
    - Fills missing values with zero for specified columns.
    - Converts 'console' column values to lowercase.


    Parameters:
    df (pd.DataFrame): The DataFrame to preprocess.


    Returns:
    pd.DataFrame: The preprocessed DataFrame.
    """

def preprocess_data(df):
    df = df.copy()
    df['release_date'] = pd.to_datetime(df['release_date'])
    df.drop(['img', 'last_update'], axis=1, inplace=True)
    df = df.dropna(subset=['release_date'])
    columns_to_fill_zero = ['critic_score', 'total_sales', 'na_sales', 'jp_sales', 'pal_sales', 'other_sales']
    df.loc[:, columns_to_fill_zero] = df.loc[:, columns_to_fill_zero].fillna(0)
    df['console'] = df['console'].str.lower()
    return df

```


    
2. **Save the File**

    * **Save the file**: Ensure `data_preprocessing.py` is saved in the same directory as your notebook, or in a directory that's included in your Python path.



3. **Import Functions in Your Notebook or Script**

    * **Open your notebook or script**: Go to the notebook or script where you want to use the functions.



**Import the function**: Use the `import` statement to bring functions from your `.py` file into your current code. For example: 

```python 
# Importing the preprocess_data function from the data_preprocessing module

from data_preprocessing import preprocess_data

# Use the function
preprocessed_RawData = preprocess_data(RawData)
preprocessed_RawData.head(1)

```



4. **Verify Imports**

    * **Check for errors**: Ensure that there are no import errors. If you encounter any issues, double-check the file path and ensure there are no syntax errors in your `.py` file.

    * **Restart the kernel**: In Jupyter notebooks, you might need to restart the kernel to recognize the new file.

By following these steps, you can keep your notebook or script focused on the analysis while keeping function definitions in a separate, manageable file.

In [5]:
preprocessed_RawData['console'].value_counts()

console
pc      10477
ps2      3511
ds       3166
ps       2694
ps4      2102
        ...  
fmt         3
cd32        2
aco         1
bbcm        1
c128        1
Name: count, Length: 79, dtype: int64

Problem: We have consoles for the sales data, but we don't have who the manufacturer is for the consoles. To solve this we will map the consoles to a manfacture using a dictionary. 

Mapping console to make new manufacture column
- Make dictionary for mapping 
- Flatten into a single list 
- Checking for missing items 
- Create conditions and values for np.select
- Assign console manufacturers


In [6]:
categories = {
    'nintendo': ['3ds', 'dsiw', 'dsi', 'ds', 'wii', 'wiiu', 'ns', 'gb', 'gba', 'nes', 'snes', 'gbc', 'n64', 'vb', 'gc', 'vc','ww'],
    'pc': ['linux', 'osx', 'pc', 'arc', 'all', 'fmt', 'c128', 'aco'],
    'xbox': ['x360', 'xone', 'series', 'xbl', 'xb', 'xs'],
    'sony': ['ps', 'ps2', 'ps3', 'ps4', 'ps5', 'psp', 'psv', 'psn', 'cdi'],
    'mobile': ['ios', 'and', 'winp', 'ngage', 'mob'],
    'sega': ['gg', 'msd', 'ms', 'gen', 'scd', 'sat', 's32x', 'dc'],
    'atari': ['2600', '7800', '5200', 'aj', 'int'],
    'commodore': ['amig', 'c64', 'cd32'],
    'other': ['ouya', 'or', 'acpc', 'ast', 'apii', 'pce', 'zxs', 'lynx', 'ng', 'zxs', '3do', 'pcfx', 'ws', 'brw', 'cv', 'giz', 'msx', 'tg16', 'bbcm']
}

# Step 2: Flatten categories into a single list
all_items = []
for sublist in categories.values():
    for item in sublist:
        all_items.append(item)

# Step 3: Check for missing items
all_items_lower = [item.lower().strip() for item in all_items]
unique_values_lower = set(RawData['console'].str.lower().str.strip().unique())
missing_items = set(all_items_lower) - unique_values_lower

if missing_items:
    print(f"Missing items: {missing_items}")
else:
    print("All items are covered.")

# Step 4: Create conditions and values for np.select
conditions = []
for items in categories.values():
    conditions.append(RawData['console'].isin(items))

values = list(categories.keys())

# Step 5: Assign console manufacturers
RawData['console_mfg'] = np.select(conditions, values, default='unknown')

All items are covered.


NameError: name 'np' is not defined

In [None]:
RawData['console_mfg'].value_counts()

console_mfg
unknown    63383
atari        633
Name: count, dtype: int64

In [None]:
missing_consoles = RawData[RawData['console_mfg'] == "unknown"]['console']
print("Consoles with unknown manufacturers:")
missing_consoles.value_counts()

Consoles with unknown manufacturers:


console
PC      12617
PS2      3565
DS       3288
PS4      2878
PS       2707
        ...  
TG16        3
FDS         1
C128        1
Aco         1
BBCM        1
Name: count, Length: 78, dtype: int64

In [None]:
ConsoleToQuery = 'bbcm'
QueryResult = RawData[RawData['console'] == ConsoleToQuery]
print(f"Rows where console = '{ConsoleToQuery}':")
QueryResult.head(20)

Rows where console = 'bbcm':


Unnamed: 0,img,title,console,genre,publisher,developer,critic_score,total_sales,na_sales,jp_sales,pal_sales,other_sales,release_date,last_update,console_mfg


In [None]:
GameToQuery = 'The Great Giana Sisters'
GameQueryResult = RawData[RawData['title'] == GameToQuery]
GameQueryResult.head(20)

Unnamed: 0,img,title,console,genre,publisher,developer,critic_score,total_sales,na_sales,jp_sales,pal_sales,other_sales,release_date,last_update,console_mfg
38407,/games/boxart/default.jpg,The Great Giana Sisters,MSX,Misc,Rainbow Arts,Unknown,,,,,,,1987-01-01,,unknown
38408,/games/boxart/default.jpg,The Great Giana Sisters,AST,Misc,Rainbow Arts,Unknown,,,,,,,1987-01-01,,unknown
38409,/games/boxart/default.jpg,The Great Giana Sisters,Amig,Misc,Rainbow Arts,Unknown,,,,,,,1987-01-01,,unknown
38410,/games/boxart/default.jpg,The Great Giana Sisters,C64,Misc,Rainbow Arts,Unknown,,,,,,,1987-01-01,,unknown
38411,/games/boxart/default.jpg,The Great Giana Sisters,BRW,Misc,Rainbow Arts,Unknown,,,,,,,1987-01-01,,unknown
41938,/games/boxart/full_1949814AmericaFrontccc.png,The Great Giana Sisters,ACPC,Platform,Rainbow Arts,Time Warp Productions,,,,,,,1987-01-01,2018-01-06,unknown


In [None]:
categoriesList = []
for manufacturer, consoles in categories.items():
    for console in consoles:
        categoriesList.append({'manufacturer': manufacturer, 'console': console})

# Converting the list to a DataFrame
mfg_list = pd.DataFrame(categoriesList)

# Grouping by 'manufacturer' and aggregating consoles into lists
grouped_series = mfg_list.groupby('manufacturer')['console'].apply(list)

# Converting the grouped Series to Markdown
markdown_table = MarkDownSeries(grouped_series)

print(markdown_table)

| manufacturer | console |
|---|---|
| atari | 2600, 7800, 5200, aj, int |
| commodore | amig, c64, cd32 |
| mobile | ios, and, winp, ngage, mob |
| nintendo | 3ds, dsiw, dsi, ds, wii, wiiu, ns, gb, gba, nes, snes, gbc, n64, vb, gc, vc, ww |
| other | ouya, or, acpc, ast, apii, pce, zxs, lynx, ng, zxs, 3do, pcfx, ws, brw, cv, giz, msx, tg16, bbcm |
| pc | linux, osx, pc, arc, all, fmt, c128, aco |
| sega | gg, msd, ms, gen, scd, sat, s32x, dc |
| sony | ps, ps2, ps3, ps4, ps5, psp, psv, psn, cdi |
| xbox | x360, xone, series, xbl, xb, xs |



| manufacturer | console |
|---|---|
| atari | 2600, 7800, 5200, aj, int |
| commodore | amig, c64, cd32 |
| mobile | ios, and, winp, ngage, mob |
| nintendo | 3ds, dsiw, dsi, ds, wii, wiiu, ns, gb, gba, nes, snes, gbc, n64, vb, gc, vc, ww |
| other | ouya, or, acpc, ast, apii, pce, zxs, lynx, ng, zxs, 3do, pcfx, ws, brw, cv, giz, msx, tg16, bbcm |
| pc | linux, osx, pc, arc, all, fmt, c128, aco |
| sega | gg, msd, ms, gen, scd, sat, s32x, dc |
| sony | ps, ps2, ps3, ps4, ps5, psp, psv, psn, cdi |
| xbox | x360, xone, series, xbl, xb, xs |
