Lab 22 Worksheet.

Create a DataFrame named `students` with the following information:

- Name: `['Alice', 'Bob', 'Charlie', 'Dana', 'Eve']`
- ID: `[101, 102, 103, 104, 105]`
- Math Score: `[85, 90, 78, 92, 88]`
- Science Score: `[88, 76, 85, 95, 89]`
- English Score: `[92, 85, 89, 94, 90]`
1. Create the DataFrame and print it.
2. Display the column names of the DataFrame.

In [None]:
import pandas as pd

# Step 1: Create the DataFrame
students = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Dana', 'Eve'],
    'ID': [101, 102, 103, 104, 105],
    'Math Score': [85, 90, 78, 92, 88],
    'Science Score': [88, 76, 85, 95, 89],
    'English Score': [92, 85, 89, 94, 90]
})

# Step 2: Print the DataFrame
print("Students DataFrame:")
print(students)

# Step 3: Display the column names
print("\nColumn Names:")
print(students.columns)

Students DataFrame:
      Name   ID  Math Score  Science Score  English Score
0    Alice  101          85             88             92
1      Bob  102          90             76             85
2  Charlie  103          78             85             89
3     Dana  104          92             95             94
4      Eve  105          88             89             90

Column Names:
Index(['Name', 'ID', 'Math Score', 'Science Score', 'English Score'], dtype='object')


Saving and loading dataframe as CSV (comma sepearted values) files is super important! The `.to_csv` method can save them to the colab temporary workspace (or the harddisk is you were working locally on your computer).

1. Save the DataFrame as a CSV file named `students_scores.csv`.
2. Ensure the file includes the column names and an index (using the `index=True` optional argument).
3. Print a confirmation message after saving.

In [None]:
#2
# Step 2: Save the DataFrame as a CSV file
students.to_csv('students_scores.csv', index=True)

# Step 3: Print confirmation message
print("DataFrame saved as 'students_scores.csv'")


DataFrame saved as 'students_scores.csv'


The `.read_csv` method can read a CSV file to a dataframe. Using the `students_scores.csv` file, perform the following tasks:

1. Read the CSV file into a new DataFrame named `students_from_csv`.
2. Ensure that the index from the CSV file is properly handled during the import (using the `index_col=0` optional argument).
3. Print the newly loaded DataFrame to verify its contents.

In [None]:
# 3
# Step 1: Read the CSV file into a new DataFrame
students_from_csv = pd.read_csv('students_scores.csv', index_col=0)

# Step 2: Print the newly loaded DataFrame
print("DataFrame Loaded from CSV:")
print(students_from_csv)


DataFrame Loaded from CSV:
      Name   ID  Math Score  Science Score  English Score
0    Alice  101          85             88             92
1      Bob  102          90             76             85
2  Charlie  103          78             85             89
3     Dana  104          92             95             94
4      Eve  105          88             89             90


The `.rename()` method in Pandas is used to change the labels of a DataFrame's index or columns. It allows you to rename specific rows or columns without altering the data itself. Using `.rename()`, perform the following tasks:

1. Rename the column `'Math Score'` to `'Mathematics'`.
2. Rename the column `'Science Score'` to `'Science'`.
3. Rename the column `'English Score'` to `'English Language'`.
4. Print the updated DataFrame.

In [None]:
# 3
# Step 2: Rename the columns
students = students.rename(columns={
    'Math Score': 'Mathematics',
    'Science Score': 'Science',
    'English Score': 'English Language'
})

# Step 3: Print the updated DataFrame
print("Updated DataFrame with Renamed Columns:")
print(students)


Updated DataFrame with Renamed Columns:
      Name   ID  Mathematics  Science  English Language
0    Alice  101           85       88                92
1      Bob  102           90       76                85
2  Charlie  103           78       85                89
3     Dana  104           92       95                94
4      Eve  105           88       89                90


The `.dtypes` method displays the data type of each column in a Pandas DataFrame or Series. The `astype()` method converts the data type of a DataFrame column or Series to a specified type.

Using the students DataFrame from above (which has been modified using the `rename` method) perform the following tasks using `.dtypes` and `astype()`:

1. Examine and print the data types of all columns in the renamed DataFrame.
2. Check the data type of the `'Mathematics'` column and print whether it is numerical (`int64` or `float64`).
3. Convert the `'ID'` column to a string data type and print the updated data types of all columns.


In [None]:
# 4
# Step 2: Examine and print data types of all columns
print("Data Types of All Columns:")
print(students.dtypes)

# Step 3: Check if 'Mathematics' is numerical
math_dtype = students['Mathematics'].dtype
if math_dtype in ['int64', 'float64']:
    print("\nThe 'Mathematics' column is numerical.")
else:
    print("\nThe 'Mathematics' column is not numerical.")

# Step 4: Convert 'ID' column to string and print updated data types
students['ID'] = students['ID'].astype(str)
print("\nUpdated Data Types After Converting 'ID' to String:")
print(students.dtypes)


Data Types of All Columns:
Name                object
ID                   int64
Mathematics          int64
Science              int64
English Language     int64
dtype: object

The 'Mathematics' column is numerical.

Updated Data Types After Converting 'ID' to String:
Name                object
ID                  object
Mathematics          int64
Science              int64
English Language     int64
dtype: object


Quite often when we are making a dataframe, we might have missing data. (For example, dropped quiz grades!). `.isna()` and `.notna()` can help! `.isna()` detects missing (or NaN) values in a DataFrame or Series. `.notna()` detects non-missing (or valid) values in a DataFrame or Series.

Using the students DataFrame below, which includes missing data (represented as NaN using `np.nan`):

```python
import numpy as np
import pandas as pd

students = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Dana', 'Eve'],
    'Math': [85, 90, np.nan, 92, 88],
    'Science': [88, np.nan, 85, 95, 89],
    'English': [92, 85, 89, 94, np.nan]
})
```

Perform the following tasks:

1. Use `.isna()` to check for missing values in the DataFrame and print the result.
2. Use `.notna()` to check for non-missing values in the DataFrame and print the result.
3. Calculate and print the total number of missing values in each column using `.isna().sum()`.
4. Calculate and print the total number of non-missing values in each column using `.notna().sum()`.

In [None]:
# 5
import numpy as np
import pandas as pd

# Step 1: Define the DataFrame with missing values
students = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Dana', 'Eve'],
    'Math': [85, 90, np.nan, 92, 88],
    'Science': [88, np.nan, 85, 95, 89],
    'English': [92, 85, 89, 94, np.nan]
})

# Step 2: Check for missing values using .isna()
print("Missing Values in DataFrame (True indicates missing):")
print(students.isna())

# Step 3: Check for non-missing values using .notna()
print("\nNon-Missing Values in DataFrame (True indicates non-missing):")
print(students.notna())

# Step 4: Calculate the total number of missing values in each column
print("\nTotal Number of Missing Values in Each Column:")
print(students.isna().sum())

# Step 5: Calculate the total number of non-missing values in each column
print("\nTotal Number of Non-Missing Values in Each Column:")
print(students.notna().sum())

Missing Values in DataFrame (True indicates missing):
    Name   Math  Science  English
0  False  False    False    False
1  False  False     True    False
2  False   True    False    False
3  False  False    False    False
4  False  False    False     True

Non-Missing Values in DataFrame (True indicates non-missing):
   Name   Math  Science  English
0  True   True     True     True
1  True   True    False     True
2  True  False     True     True
3  True   True     True     True
4  True   True     True    False

Total Number of Missing Values in Each Column:
Name       0
Math       1
Science    1
English    1
dtype: int64

Total Number of Non-Missing Values in Each Column:
Name       5
Math       4
Science    4
English    4
dtype: int64


The `.fillna()` method in Pandas is used to replace missing values (NaN) in a DataFrame or Series with specified values. This is particularly useful for handling incomplete datasets where certain values are missing.

Using the students DataFrame from above, which includes missing data (represented as NaN), replace all missing values (NaN) in the DataFrame with the following default values using `.fillna()`:
- `'Math': 80`
- `'Science': 75`
- `'English': 70`

Print the updated DataFrame after replacing the missing values.

In [None]:
# 6
# Step 5: Replace missing values using .fillna()
students = students.fillna({
    'Math': 80,
    'Science': 75,
    'English': 70
})

# Step 6: Print the updated DataFrame
print("\nUpdated DataFrame After Replacing Missing Values:")
print(students)


Updated DataFrame After Replacing Missing Values:
      Name  Math  Science  English
0    Alice  85.0     88.0     92.0
1      Bob  90.0     75.0     85.0
2  Charlie  80.0     85.0     89.0
3     Dana  92.0     95.0     94.0
4      Eve  88.0     89.0     70.0


Using the students DataFrame from above (which should have all the values now!)

1. Add a new column `'Participation'` to the DataFrame with values `[10, 9, 8, 10, 7]`.
2. Add another new column `'Total Score'` that contains the sum of scores from `'Math'`, `'Science'`, `'English'`, and `'Participation'`.
3. Print the updated DataFrame with the new columns.

In [None]:
# 7
# Step 2: Add a new column 'Participation'
students['Participation'] = [10, 9, 8, 10, 7]

# Step 3: Add a new column 'Total Score'
students['Total Score'] = students['Math'] + students['Science'] + students['English'] + students['Participation']

# Step 4: Print the updated DataFrame
print("Updated DataFrame with New Columns:")
print(students)

Updated DataFrame with New Columns:
      Name  Math  Science  English  Participation  Total Score
0    Alice  85.0     88.0     92.0             10        275.0
1      Bob  90.0     75.0     85.0              9        259.0
2  Charlie  80.0     85.0     89.0              8        262.0
3     Dana  92.0     95.0     94.0             10        291.0
4      Eve  88.0     89.0     70.0              7        254.0


Using the students DataFrame from above, perform the following tasks:

1. Use the `.iloc` method to retrieve and print all scores (excluding the name) for the third student (`Charlie`).
2. Calculate and print the standard deviation of the `'Math'` scores using `.std()`.
3. Use `.iloc` to calculate and print the standard deviation of all scores (excluding participation) for the first student (`Alice`).


In [None]:
# 8
# Step 2: Retrieve all scores for the third student ('Charlie') using iloc
charlie_scores = students.iloc[2, 1:]  # Exclude the name column
print("Charlie's Scores:")
print(charlie_scores)

# Step 3: Calculate the standard deviation of the 'Math' scores
math_std = students['Math'].std()
print("\nStandard Deviation of Math Scores:", math_std)

# Step 4: Calculate the standard deviation of all scores (excluding participation) for the first student ('Alice')
alice_scores = students.iloc[0, 1:4]  # Math, Science, English
alice_std = alice_scores.std()
print("\nStandard Deviation of Alice's Scores (Math, Science, English):", alice_std)


Charlie's Scores:
Math              80.0
Science           85.0
English           89.0
Participation        8
Total Score      262.0
Name: 2, dtype: object

Standard Deviation of Math Scores: 4.69041575982343

Standard Deviation of Alice's Scores (Math, Science, English): 3.511884584284246
