<a href="https://colab.research.google.com/github/brooklynbowers1/pandas_practice_ztm/blob/main/Data_Cleaning_and_Preprocessing_with_Pandas_and_Numpy_Interactive_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img style="display: block; margin: 0 auto" src="https://images.squarespace-cdn.com/content/v1/645a878d9740963714b8f343/3efb24e3-9fb9-4bc7-b41e-7f36742ae747/2-2.jpg?format=1500w" alt="Lonely Octopus Logo">

**Please create a copy of the notebook in your gdrive to be able to edit it.**

**You can make a copy from the menu: File > Save a copy in Drive**

# **Data Cleaning and Preprocessing - Interactive Practice**

# **Pandas - Basics**

## <ins>Introduction to the pandas Library<ins>
* The pandas DataFrame is a versatile two-dimensional data structure for organizing and manipulating tabular data.
*It serves as a powerful tool for data scientists and analysts, facilitating tasks such as data cleaning, preprocessing, manipulation, and analysis.
*DataFrames are created easily from various data sources, such as lists, dictionaries, NumPy arrays, and CSV files.
*Understanding DataFrame operations is crucial for efficient data analysis and exploration, with comprehensive guidance available in the official pandas documentation.





### <ins>Importance<ins>
* DataFrames play a vital role in data science by offering functionalities for organizing, cleaning, and analyzing data efficiently.
*They provide a structured format for handling heterogeneous data, enabling data-driven decision-making across various domains.
*By leveraging DataFrames, analysts can extract meaningful insights, identify patterns, and make informed decisions based on accurate and organized data.

### <ins> How to find your pandas version <ins>
* Example below:



In [None]:
import pandas as pd
print(pd.__version__)

1.5.3


### <ins> How to install pandas using pip <ins>
* Example below:

In [None]:
# Installing pandas using pip
!pip install pandas

## <ins> Introduction to pandas Series <ins>
Covers pandas Series, one-dimensional labeled arrays capable of holding data of any type. The pandas Series is a fundamental data structure for single-dimensional data analysis.


### <ins> How to create a pandas series <ins>
* Example below:

In [None]:
# Creating a pandas Series
series = pd.Series([1, 3, 5, 9, 6, 8])
series

0    1
1    3
2    5
3    9
4    6
5    8
dtype: int64


**Practice task:** <nbsp>
Create a new series with the following adjustments:
> Set the `series` to a random group of 8 numbers <br>
> print `series` to observe changes <br>

Compare and comment on the differences between the example and new output
    
Show the new code and include the comments below

In [None]:
# Practice Task here

## <ins> Working with Attributes in Python <ins>
Explains how to access and use attributes of pandas objects to retrieve information about the data.Attributes provide quick insights into data without needing to invoke a method.


### <ins> How to access the `dtype` attribute from pandas Series <ins>
* Example below:

In [None]:
# Accessing the 'dtype' attribute from pandas Series
series = pd.Series([1, 3, 5, 6, 6, 8])
series.dtype

int64


**Practice task:** <nbsp>
Create a new series with the following adjustments:
> Set the `series` to 6 random letters <br> **Hint**: Apply quotation marks around the letter, like this "A" <br>
> print `series` to observe changes <br>

Compare and comment on the differences between the example and new output
    
Show the new code and include the comments below

In [None]:
# Practice Task here

## <ins> Using an Index in pandas <ins>
Discusses the index, the immutable array that enables data alignment and quick data retrieval. Indexing is crucial for efficient data access and manipulation in pandas.


### <ins> How to set an index in a pandas Series <ins>
* Example below:

In [None]:
# Setting an index in pandas Series
series = pd.Series([1, 3, 5, 7, 9], index=['a', 'b', 'c', 'd', 'e'])
series

a    1
b    3
c    5
d    7
e    9
dtype: int64

**Practice task:** <nbsp>
Create a new series with the following adjustments:
> Set the `index` to 5 type of fruits <br> **Hint**: Apply quotation marks around the letter, like this 'Apples' <br>
> print `series` to observe changes <br>

Compare and comment on the differences between the example and new output
    
Show the new code and include the comments below

In [None]:
# Practice Task here

## <ins> Indexing Insight: Label vs Position-Based <ins>
Comparison between using labels `.loc[]` and integer positions `.iloc[]` for indexing. Knowing when to use label-based vs. position-based indexing can help in selecting data more intuitively.


### <ins> How to use label and position based indexing <ins>
* Example below:

In [None]:
# Setting an index in pandas Series
series = pd.Series([1, 3, 5, 7, 9], index=['a', 'b', 'c', 'd', 'e'])
# Label-based indexing
series.loc['a']
# Position-based indexing
series.iloc[0]

1

**Practice task:** <nbsp>
Print out your code with the following adjustments:
> Set the label-based `index` to 'b' <br>
> Set the position-based `index` to '2' <br>

Compare and comment on the differences between the example and new output
    
Show the new code and include the comments below

In [None]:
# Practice Task here

## <ins> Indexing Demystified: A Deep Dive <ins>
Explores advanced techniques and functionalities related to indices. Indices provide powerful ways to manipulate and work with data structures.


### <ins> How to use label and position based indexing <ins>
* Example below:

In [None]:
# Resetting the index of a DataFrame
df = pd.DataFrame({'A': range(4), 'B': range(4, 8)})
df.reset_index()

Unnamed: 0,index,A,B
0,0,0,4
1,1,1,5
2,2,2,6
3,3,3,7


**Practice task:** <nbsp>
Create a new series with the following adjustments:
> Set `A` to B<br>
> Set `B` to C <br>

Compare and comment on the differences between the example and new output <br>
**Extras**: Set `A` to 5 only without changing B and see what is the output<br>
Show the new code and include the comments below

In [None]:
# Practice Task here

## <ins> Method Mastery: Leveling Up in Python in Two Parts <ins>
Introduces methods, which are functions that belong to pandas objects. Methods allow for complex data operations and manipulations within pandas.


### <ins> Part I - How to preview a data series <ins>
* Example below:

In [None]:
# Using the 'head()' method to preview the first few rows of a Data Series
df = pd.Series({'A': 5, 'B': 2, 'C': 2, 'D': 2,'E': 2})
df.head()

A    5
B    2
C    2
D    2
E    2
dtype: int64

**Practice task:** <nbsp>
Make the following adjustments to the code:
> Set `A` to 1<br>
> Set `B` to 3<br>
> Set `C` to 8<br>
> Set `D` to 6<br>
> Set `E` to 5<br>

Compare and comment on the differences between the example and new output <br>

Show the new code and include the comments below

In [None]:
# Practice Task here

### <ins> Part II - How to find the sum a data series <ins>
* Example below:

In [None]:
# Using the 'sum()' method to find the total nnumber of people who participated in a sport
df = pd.Series({'Hockey': 2, 'Fishing': 8, 'Hiking': 5, 'Swimming': 5,'Athletics': 2})
df.sum()

22

**Practice task:** <nbsp>
Make the following adjustments to the code:
> Set `Swimming` to 0<br>
> Set `Fishing` to 6<br>


Compare and comment on the differences between the example and new output<br>

Show the new code and include the comments below

In [None]:
# Practice Task here

## <ins> How do you differentiate an argument with a parameter? <ins>
Clarifies the difference between parameters (variables in method definitions) and arguments (values passed to methods). Understanding this distinction is key to customizing method behaviors.


### <ins> How to define parameters and arguments <ins>
* Example below:

In [None]:
# Using the 'head()' method to preview the first few rows of a Data Series
df = pd.Series({'A': 5, 'B': 2, 'C': 2, 'D': 2,'E': 2})
df.head(2) # The option to set the number of rows is the parameter of the .head() method
# Our choice of that number is an argument


A    5
B    2
dtype: int64

**Practice task:** <nbsp>
Print your output with the following adjustments:
> Set the argument to 3,  **Hint**: `df.head(3)`<br>

Compare and comment on the differences between the example and new output <br>
**Extras**: Set the method to `.tail(2)` and see what is the output<br>
Show the new code and include the comments below

In [None]:
# Practice Task here

## <ins> The Pandas Documentation <ins>
An overview of how to navigate and utilize the pandas documentation for learning and troubleshooting. The documentation is an invaluable resource for understanding the functionalities and capabilities of pandas.


### <ins> Link to website <ins>
**Pandas Documentation**: https://pandas.pydata.org/docs/ <br>
**Pandas Documentation - Series**: https://pandas.pydata.org/docs/reference/api/pandas.Series.sum.html

## <ins> DataFrame Dynamics: An Overview <ins>
Covers the DataFrame, a two-dimensional, size-mutable, potentially heterogeneous tabular data structure. DataFrames are central to pandas and data analysis tasks, supporting a wide variety of operations.


### <ins> How to create a DataFrame <ins>
* Example below:

In [None]:
# Creating a simple DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


**Practice task:** <nbsp>
Create a new DataFrame with the following adjustments:
> Add a new column 'C' to the DataFrame with values [7,8,9] <br>

Compare and comment on the differences between the example and new output <br>

Show the new code and include the comments below

In [None]:
# Practice Task here

## <ins> Creating DataFrames from Scratch - Part I & Part II <ins>
Demonstrates various methods for creating DataFrames from different data sources. Flexibility in creating DataFrames is crucial for custom data analysis workflows.


### <ins> Part I - How to create a DataFrame from a dictionary of lists <ins>
* Example below:

In [None]:
# creating lists
l1 =["Ali", "Quentin", "Lionel", "Bruce", "Phoenix"]
l2 =["Zoro", "King", "Chaplin", "Queen", "Joker"]
l3 =[1, 43, 55, 30, 20]
l4 =[48, 12, 47, 71, 19]

# creating the DataFrame
team = pd.DataFrame(list(zip(l1, l2, l3, l4)))
team

Unnamed: 0,0,1,2,3
0,Ali,Zoro,1,48
1,Quentin,King,43,12
2,Lionel,Chaplin,55,47
3,Bruce,Queen,30,71
4,Phoenix,Joker,20,19


**Practice task:** <nbsp>
Print your output with the following adjustments:
> Assign column names to the DataFrame,  **Hint**: team.columns =['Name', 'Code', 'Age', 'Weight']<br>

Compare and comment on the differences between the example and new output <br>

Show the new code and include the comments below

In [None]:
# Practice Task here

### <ins> Part II - How to create a DataFrame from a list of lists <ins>
* Example below:

In [None]:
# Creating a DataFrame from a list of lists
data = [['work',3],['hobbies',2],['relationship',3]]
df = pd.DataFrame(data)
df

Unnamed: 0,0,1
0,work,3
1,hobbies,2
2,relationship,3


**Practice task:** <nbsp>
Print your output with the following adjustments:
> Assign column names to the DataFrame,  **Hint**: df.columns =['Variable 1', 'Variable 2', 'Variable 3']<br>

Compare and comment on the differences between the example and new output <br>

Show the new code and include the comments below

In [None]:
# Practice Task here

# **Data Cleaning and Data Preprocessing - Pandas Series**

* Explains the processes involved in preparing raw data for analysis, including cleaning data (removing or correcting inaccuracies) and preprocessing data (transforming raw data into a usable format). <br>


## <ins> Data Distillation: Extracting Uniqueness <ins>
Demonstrates how to identify unique values and count them within a pandas Series. Useful for data exploration and understanding the diversity of data.


### <ins> How to find unique values in a series <ins>
* Example below:

In [None]:
series = pd.Series([1, 2, 2, 3, 4, 4, 4])
print(series.unique())
print(series.nunique())

[1 2 3 4]
4


**Practice task:** <nbsp>
Print your output with the following adjustments::
> Change series to [5,5,5,7,8,1,2] and find the unique and number of unique values in a sales DataFrame.

Compare and comment on the differences between the example and new output
    
Show the new code and include the comments below

In [None]:
# Practice Task here

## <ins> Numeric Navigation: Series to Arrays <ins>
Shows how to convert a pandas Series into a numpy array. Necessary for integrating pandas data structures with numpy operations or machine learning algorithms.


### <ins> How to convert a series into a numpy <ins>
* Example below:

In [None]:
import pandas as pd
import numpy as np
Price = pd.Series([200, 300, 450,150,890])
numpyarray = Price.to_numpy()
numpyarray

array([200, 300, 450, 150, 890])

**Practice task:** <nbsp>
Print your output with the following adjustments::
> Let numpyarray = `np.array`(Price.index.values)

Compare and comment on the differences between the example and new output
    
Show the new code and include the comments below

In [None]:
# Practice Task here

## <ins> Value Voyage: Sorting and Organizing Data` <ins>
Sorts a pandas Series by its values. Essential for organizing data in ascending or descending order to analyze trends or outliers.


### <ins> How to sort a series <ins>
* Example below:

In [None]:
import pandas as pd
import numpy as np
Price = pd.Series([5, 8, 2,9,14])
Price.sort_values()

2     2
0     5
1     8
3     9
4    14
dtype: int64

**Practice task:** <nbsp>
Print your output with the following adjustments::
> Set `Price`  to a series with [9,8,1,2,7] and track the index values

Compare and comment on the differences between the example and new output
    
Show the new code and include the comments below

In [None]:
# Practice Task here

## <ins> Chaining Challenge: Maximizing Efficiency <ins>
Combining multiple attributes and methods in a single line of code. Enhances code readability and efficiency.


### <ins> How to perform an attribute and method chaining <ins>
* Example below:
 * Attribute chaining is demonstrated by chaining attribute access with string methods to modify the breed of the dog. <br>
 * Method chaining is demonstrated by chaining the `bark()` method call with the `capitalize()` method to capitalize the output of the bark method.

In [None]:
class Dog:
    def __init__(self, breed, age):
        self.breed = breed
        self.age = age

    def bark(self):
        return "Woof!"

    def age_in_dog_years(self):
        return self.age * 7

# Creating an instance of the Dog class
my_dog = Dog("Labrador", 5)

# Attribute chaining example (modified)
breed_attribute_modified = my_dog.breed.lower().replace('a', 'o')  # Modified attribute chaining
print("Modified breed using attribute chaining:", breed_attribute_modified)

# Method chaining example (enhanced)
bark_method_chain_enhanced = my_dog.bark().upper().replace('O', 'X')  # Enhanced method chaining
print("Modified bark method output using enhanced method chaining:", bark_method_chain_enhanced)

# Additional task: Calculate dog's age in dog years
dog_years_age = my_dog.age_in_dog_years()
print("Dog's age in dog years:", dog_years_age)

Modified breed using attribute chaining: lobrodor
Modified bark method output using enhanced method chaining: WXXF!
Dog's age in dog years: 35


**Practice task:** <nbsp>
Print your output with the following adjustments::
> Change the dog's age to be 5 times the age of a person, <br>**Hint**:<br>
def age_in_dog_years(self) <br>
                  return self.age * 5 <br>

Compare and comment on the differences between the example and new output
    
Show the new code and include the comments below

In [None]:
# Practice Task here

## <ins> Index Intelligence: Sorting Strategies` <ins>
Sorts a pandas Series or DataFrame by its index. Useful for returning data to its original order or organizing it according to the index.


### <ins> How to sort a data series by index <ins>
* Example below:
 * We create a Series series with data and index provided as a dictionary.
 * The index in the original Series is unsorted ('b', 'a', 'c').
 * We use the `.sort_index()` method to sort the Series by index.
The sorted Series sorted_series now has the index sorted in alphabetical order ('a', 'b', 'c').

In [None]:
import pandas as pd

# Create a Series with unsorted index
data = {'b': 2, 'a': 1, 'c': 3}
series = pd.Series(data)

# Print the unsorted Series
print("Unsorted Series:")
print(series)

# Sort the Series by index
sorted_series = series.sort_index()

# Print the sorted Series
print("\nSorted Series:")
print(sorted_series)

Unsorted Series:
b    2
a    1
c    3
dtype: int64

Sorted Series:
a    1
b    2
c    3
dtype: int64


**Practice task:** <nbsp>
Print your output with the following adjustments::
> Change the data to be 5 types of fruits to number of fruits sold, **Hint**: 'Apple': 34, 'Orange': 60 <br>

Compare and comment on the differences between the example and new output
    
Show the new code and include the comments below

In [None]:
# Practice Task here

# **Pandas DataFrames** <br>
* Introduce the importance of understanding pandas DataFrames and the significance of efficient data manipulation in data analysis and data science projects. <br>


## <ins> Data Detective: Exploring Common Attributes <ins>
Discuss common attributes such as `shape`, `columns`, `index`, `dtypes`, and `info()` for exploring DataFrame properties. Understanding these attributes helps in gaining insights into the structure and characteristics of the DataFrame.


### <ins> How to display various dataframe attributes <ins>
* Example below:

In [None]:
import pandas as pd

# Create a DataFrame
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Display DataFrame attributes
print("Shape:", df.shape)
print("Columns:", df.columns)
print("Index:", df.index)
print("Data Types:", df.dtypes)

Shape: (3, 2)
Columns: Index(['A', 'B'], dtype='object')
Index: RangeIndex(start=0, stop=3, step=1)
Data Types: A    int64
B    int64
dtype: object


**Practice task:** <nbsp>
Print your output with the following adjustments::
> Set the `data` to thress variables besides A and B, let the dataframe contain 4 values, **Hint**: 'M': [1,2,3,4] <br>

Compare and comment on the differences between the example and new output
    
Show the new code and include the comments below

In [None]:
# Practice Task here

## <ins> Data Selection: Navigating the DataFrame Jungle <ins>
Explore various methods for selecting data from DataFrames. Efficient data selection is crucial for extracting relevant information from large datasets.


### <ins> How to select and display specific columns from a dataframe <ins>
* Example below:

In [None]:
# Create a DataFrame
data = {'A': [7, 8, 9], 'B': [12, 13, 14]}
df = pd.DataFrame(data)

# Select a single column
column_A = df['A']

# Select multiple columns
columns_AB = df[['A', 'B']]
print(column_A)
print(columns_AB)

0    7
1    8
2    9
Name: A, dtype: int64
   A   B
0  7  12
1  8  13
2  9  14


**Practice task:** <nbsp>
Print your output with the following adjustments::
> Add an additional variable `c` to the dataframe with 3 values of your choosing <br>
 > let column_BC = `df[['B','C']]` and print the output <br>

Compare and comment on the differences between the example and new output
    
Show the new code and include the comments below

In [None]:
# Practice Task here

## <ins> Index Insight: Unlocking Data with `.iloc[]` <ins>
Introduce the `iloc[]` method for integer-based indexing of rows and columns. `.iloc[]` provides a flexible way to access specific rows and columns in a DataFrame.


### <ins> How to use `iloc` indexing to selectively categorize data <ins>
* Example below:

In [None]:
# Create a DataFrame
data = {'Octopus': [5, 7, 9], 'Squid': [12, 24, 36]}

df = pd.DataFrame(data)
df = df.rename(index={0:'cost in 2022',1:'cost in 2023',2:'cost in 2024'})

# Select a single element using integer indexing
element = df.iloc[0, 1]

# Select a row using integer indexing
row = df.iloc[1]

# Select multiple rows and columns using integer indexing
subset = df.iloc[0:2, 1:3]

print(element)
print(row)
print(subset)

12
Octopus     7
Squid      24
Name: cost in 2023, dtype: int64
              Squid
cost in 2022     12
cost in 2023     24


**Practice task:** <nbsp>
Print your output with the following adjustments::
> Add an additional sea creature to the dataframe and give it 3 values<br>

Compare and comment on the differences between the example and new output
    
Show the new code and include the comments below

In [None]:
# Practice Task here

## <ins> Location, Location, Location: Leveraging `.loc[]` <ins>
Introduce the .loc[] method for label-based indexing of rows and columns. `.loc[]` allows for selecting data based on row and column labels, providing a more intuitive approach in many cases.

### <ins> How to use `.loc()` indexing to selectively categorize data<ins>
* Example below:

In [None]:
import pandas as pd

# Create a DataFrame
data = {'Car': [5, 7, 9], 'Motorcycle': [12, 24, 36]}

# Create DataFrame and rename index
df = pd.DataFrame(data)
df = df.rename(index={0:'speed', 1:'mileage', 2:'fuel_capacity'})

# Select a single element using label indexing
element = df.loc['speed', 'Car']

# Select a row using label indexing
row = df.loc['mileage']

# Select multiple rows and columns using label indexing
subset = df.loc[['speed', 'mileage'], ['Car', 'Motorcycle']]

print(element)
print(row)
print(subset)

5
Car            7
Motorcycle    24
Name: mileage, dtype: int64
         Car  Motorcycle
speed      5          12
mileage    7          24


**Practice task:** <nbsp>
Print your output with the following adjustments::
> Set `element` with 'Motorcycle' and 'mileage' <br>
> Set `row` with 'fuel_capacity' <br>
> Set `subset` with ['fuel_capacity','speed'] and ['Car'] only <br>

Compare and comment on the differences between the example and new output
    
Show the new code and include the comments below

In [None]:
# Practice Task here

## <ins> Locating Insights: Mastering `.loc[]` and `.iloc[]` <ins>
* Provide insights and best practices on when to use .loc[] and .iloc[] for data selection.

* Understanding the differences between these two methods ensures correct and efficient data access.

### <ins> Summary <ins>
* `.iloc[]` uses integer-based indexing, while `.loc[]` uses label-based indexing.
* `.iloc[]` is exclusive of the end index, while `.loc[]` is inclusive.
* Use `.iloc[]` for positional indexing and `.loc[]` for label-based indexing.

Also look into the course material for further context(https://learn.365datascience.com/courses/data-cleaning-preprocessing-pandas/a-few-comments-on-using-loc-and-iloc/)

# **Pandas Numpy <br>**
* NumPy and pandas are two essential libraries in the Python ecosystem for data manipulation and analysis. While NumPy provides the foundation for numerical computing with its powerful array data structure, pandas builds upon NumPy to offer high-level data structures and functions designed specifically for data analysis tasks.


## <ins> NumPy Essentials: Understanding ndarrays <ins>
NumPy arrays, also known as ndarrays, are the backbone of numerical computing in Python. They provide a powerful data structure for efficiently storing and manipulating large, multi-dimensional arrays of homogeneous data. Understanding NumPy arrays is crucial for performing advanced numerical computations and data analysis tasks in Python.

### <ins> How to create a NumPy array (numpy_array) containing numerical data <ins>
* Example below:

In [None]:
import numpy as np

# Create a NumPy array
numpy_array = np.array([[1, 2, 3], [4, 5, 6]])

# Display the NumPy array
print("NumPy Array:")
print(numpy_array)

NumPy Array:
[[1 2 3]
 [4 5 6]]


**Practice task:** <nbsp>
Print your output with the following adjustments::
> Add an extra array to the `numpy_array` **Hint**: ['Octopod','Sea','Ocean'] <br>

Compare and comment on the differences between the example and new output
    
Show the new code and include the comments below

In [None]:
# Practice Task here

## <ins> From NumPy to Pandas: Creating Data Structures <ins>
Pandas DataFrames are built on top of NumPy arrays, allowing for easy integration between the two libraries. Creating pandas DataFrames from NumPy arrays provides a convenient way to analyze and manipulate structured data. Converting NumPy arrays to pandas DataFrames enables seamless data analysis using pandas' high-level functions and methods.


### <ins> How to convert NumPy array to a pandas DataFrame <ins>
* Example below:

In [None]:
import pandas as pd
# Create a NumPy array
numpy_array = np.array([[21, 25, 36], [103, 22, 84]])

# Convert NumPy array to DataFrame
df = pd.DataFrame(numpy_array, columns=['Kingfish', 'Barracuda', 'Sailfish'], index =[2020,2021])

# Display the DataFrame
print("\nPandas DataFrame:")
print(df)


Pandas DataFrame:
      Kingfish  Barracuda  Sailfish
2020        21         25        36
2021       103         22        84


**Practice task:** <nbsp>
Print your output with the following adjustments::
> Add an additional value in each set for the `numpy_array`, **Hint**: [21, 25, 36, **41**],[103, 22, 84, **15**] <br>
> Add 2022 to the `index`, **Hint**: [2020,2021,**2022**]<br>

Compare and comment on the differences between the example and new output
    
Show the new code and include the comments below

In [None]:
# Practice Task here

## <ins>  Pandas to NumPy: Extracting Data with Ease <ins>
While pandas offers powerful data manipulation capabilities, there are situations where working directly with NumPy arrays is necessary. Extracting NumPy arrays from pandas structures allows for integration with other libraries and advanced numerical computations. Extracting NumPy arrays from pandas structures provides flexibility and interoperability with other libraries, facilitating advanced data analysis workflows.


### <ins> How to extract a NumPy array from a DataFrame column <ins>
* Example below:

In [None]:
# Extract a NumPy array from a DataFrame column

df = pd.DataFrame({'House_color': ['Yellow_house','Green_house', 'Red_house'],'Points':[45,50,33]})

numpy_array_house = df['House_color'].to_numpy()

# Display the extracted NumPy array
print("\nExtracted NumPy Array from DataFrame:")
print(numpy_array_house)


Extracted NumPy Array from DataFrame:
['Yellow_house' 'Green_house' 'Red_house']


**Practice task:** <nbsp>
Print your output with the following adjustments::
> Extract the points instead of the house color, **Hints**:`df[**'Points'**] ` <br>
> Don't forget to change the name of the numpy_array to reference the new data extracted, **Hints**: numpy_array_points

Compare and comment on the differences between the example and new output
    
Show the new code and include the comments below

In [None]:
# Practice Task here

## <ins>  NumPy Math: Perform Calculations Efficiently <ins>
pandas leverages the efficiency and versatility of NumPy for performing numerical computations on data. Many pandas functions and methods internally use NumPy arrays for efficient computation of statistics and other operations.Leveraging NumPy for calculations in pandas ensures efficient processing of large datasets and enables the use of advanced numerical algorithms for data analysis tasks.

### <ins> How to calculate the mean of a DataFrame using NumPy's mean function <ins>
* Example below:

In [None]:
import pandas as pd
import numpy as np

# Sample DataFrame for athletic games
data = {
    'Athlete': ['Richard', 'James', 'Anabel', 'Lily'],
    '100m Sprint': [10.5, 11.2, 10.8, 10.9],
    'Long Jump': [7.5, 6.8, 7.2, 7.0],
    'High Jump': [2.0, 1.9, 2.1, 2.2]
}

df = pd.DataFrame(data)

# Calculate mean using NumPy across specific columns (axis=1 for row-wise operation)
df['Mean'] = np.mean(df[['100m Sprint', 'Long Jump', 'High Jump']], axis=1,)

# Display the DataFrame with the mean column
print("\nDataFrame with Mean Column:")
print(df)




DataFrame with Mean Column:
   Athlete  100m Sprint  Long Jump  High Jump      Mean
0  Richard         10.5        7.5        2.0  6.666667
1    James         11.2        6.8        1.9  6.633333
2   Anabel         10.8        7.2        2.1  6.700000
3     Lily         10.9        7.0        2.2  6.700000


**Practice task:** <nbsp>
Print your output with the following adjustments::
> Give James a Long Jump distance of 8 and monitor the new average <br>


Compare and comment on the differences between the example and new output
    
Show the new code and include the comments below

In [None]:
# Practice Task here

## <ins>  NumPy Logic: Mastering Conditional Operations <ins>
NumPy arrays can be seamlessly integrated into pandas for performing conditional logic and filtering operations. Using NumPy arrays in conditional statements within pandas allows for flexible and efficient data manipulation.Leveraging NumPy for calculations in pandas ensures efficient processing of large datasets and enables the use of advanced numerical algorithms for data analysis tasks.

### <ins> How to demonstrate NumPy arrays with Pandas conditional logic capabilities <ins>
* Example below:

In [None]:
import pandas as pd

# Sample DataFrame for athletic games
data = {
    'Athlete': ['Richard', 'James', 'Anabel', 'Lily'],
    '100m Sprint': [15, 7, 20, 7],
    'Long Jump': [9, 8, 6, 5],
    'High Jump': [1.5, 1.6, 1.5, 2.0]
}
df = pd.DataFrame(data)

# Filter DataFrame based on condition using NumPy array
new_df = df[df['100m Sprint'].to_numpy() < 11]

# Display the filtered DataFrame
print("\nFiltered DataFrame:")
print(new_df)



Filtered DataFrame:
  Athlete  100m Sprint  Long Jump  High Jump
1   James            7          8        1.6
3    Lily            7          5        2.0


**Practice task:** <nbsp>
Print your output with the following adjustments::
> Change the condition to 'Long Jump' and < 7 <br>


Compare and comment on the differences between the example and new output
    
Show the new code and include the comments below

In [None]:
# Practice Task here

# **Final Discord submission Challenge**

This challenge tests your understanding of different visualizations, derive findings or trends in data and suggest recommendation

**Instructions:**

- Calculations: Carefully perform the following calculations.  Do not use a calculator for the calculations themselves – focus on understanding the Python code and its logic.

- Discord Submission: Explained below

In [None]:
import pandas as pd
import numpy as np

# Download the file from the link and upload the file by running this code
from google.colab import files
uploaded = files.upload()

# Loading the dataset from the Excel file
df = pd.read_csv('train.csv')

# Display the first few rows of the dataframe to verify successful loading
print(data.head())

#Download the csv file from the resource file available via the link below

Saving train.csv to train (2).csv
   Row ID        Order ID  Order Date   Ship Date       Ship Mode Customer ID  \
0       1  CA-2017-152156  08/11/2017  11/11/2017    Second Class    CG-12520   
1       2  CA-2017-152156  08/11/2017  11/11/2017    Second Class    CG-12520   
2       3  CA-2017-138688  12/06/2017  16/06/2017    Second Class    DV-13045   
3       4  US-2016-108966  11/10/2016  18/10/2016  Standard Class    SO-20335   
4       5  US-2016-108966  11/10/2016  18/10/2016  Standard Class    SO-20335   

     Customer Name    Segment        Country             City       State  \
0      Claire Gute   Consumer  United States        Henderson    Kentucky   
1      Claire Gute   Consumer  United States        Henderson    Kentucky   
2  Darrin Van Huff  Corporate  United States      Los Angeles  California   
3   Sean O'Donnell   Consumer  United States  Fort Lauderdale     Florida   
4   Sean O'Donnell   Consumer  United States  Fort Lauderdale     Florida   

   Postal Code R

[Click here to download the Train data](https://drive.google.com/file/d/1cfEp7D-3ROixvgRK5RGp45GjVTCQd8bS/view?usp=sharing)

### Challenges

- **Challenge 1:** Create a pandas Series with the Sales data <br>
                From the series, which year had the highest sales and what is the sales figure?
                
- **Challenge 2:** Set the index of the DataFrame to the Order ID <br>
                Using Order ID as index, what is the city name at the 10th row position ?
                
- **Challenge 3:** Add a new column 'Discount' with linearly spaced between 0 and 0.5 <br>
                Which order ID had the best discount?

- **Challenge 4:** Filter the DataFrame to include only orders from California <br>
                What is the date from the most recent order coming from California? Give format in dd-mm-yyyy

- **Challenge 5:** Create a NumPy array with the Postal Code data <br>
                Which postal code has the best total sales?

- **Challenge 6:** Extract a NumPy array from the 'Sales' column and increase sales by 10% <br>
                What is the year with the lowest increased sales?

- **Challenge 7:** Calculate the mean sales for each Category using NumPy and add as a new column <br>
                Using the head function, what is the category with the highest mean sales?

In [None]:
# Print the results for each challenge
# Example Output below:

**Discord Submission:** Find the corresponding Discord answer. It will have a dropdown with the following answer format:

Based on the example output above, if your calculated results are:
> Challenge 1: 2018, 22638.48<br>
> Challenge 2: Los Angeles <br>
> Challenge 3: CA-2016-128608<br>
> Challenge 4: 31-12-2017<br>
> Challenge 5: 32216.0 <br>
> Challenge 6: 2018 <br>
> Challenge 7: Furniture <br>

Your dropdown selection on Discord should be:

- 2018, 22638.48, Los Angeles, CA-2016-128608, 31-12-2017, 32216.0, 2018, Furniture

<img style="display: block; margin: 0 auto" src="https://favim.com/pd/s5/orig/151229/adorable-believe-can-cute-Favim.com-3820965.jpg" alt="Lonely Octopus end">

## <ins> Additional Notes on Using DataFrames <ins>
Provides further insights and tips on maximizing the use of DataFrames for data analysis.

### <ins>Importance<ins>
Enhances the understanding and efficient use of DataFrames in practical scenarios.


### <ins> References <ins>
**Pandas User Guide**: https://pandas.pydata.org/docs/user_guide/index.html#user-guide <br>
