# Getting Started with Basic Python Libraries

### Purpose

Provide a basic introduction to Python's basic libraries, such as numpy, pandas, matplotlib, and seaborn, highlighting their key functions and use to python applications.

---

### Content

1. Numpy Library
2. Pandas Library
3. Matplotlib Library
4. Seaborn Library

## Numpy Library

**NumPy**, short for *Numerical Python*, is a powerful and foundation library for scientific computing and numerical analysis in Python. It provides efficient data structures alongside highly optimized mathematical functions design to operate on these data strucutures. 

It was designed to efficiently hangle large volumes of data, serving as a foundation for many advanced libraries suc as **Pandas**, **scikit-learn** and others.

---

## Import NumPy Library




In [None]:
import numpy as np

## NumPy Features



1. **Linear Algebra**

    *NumPy's Linear Algebra tools* make it easy to work with matrices, supporting operations like multiplication, determinant calculation, and matrix inversion. You can also solve equations and understand key matrix properties like eigenvalues.
   
2. **Statistic**

    *NumPy's Statistics tools* simplify data analysis by providing functions to calculate common metrics, including mean, median, variance, and standard deviation. These functions help developers summarize and understand datasets quickly, making NumPy highly useful for data analysis and learning statistics.

3. **Data Manipulation**

    *NumPy's Data Manipulation tools* allow developers to effortlessly reshape arrays, combine or split data, and access specific parts of arrays using indexing and slicing. These features make organizing and adjusting large datasets straightforward and efficient.

4. **N-dimensional Arrays** 

    *NumPy's N-dimensional Arrays* are versatile tools for storing and working with data in multiple dimensions, such as lists or tables. They are fast, easy to use, and optimized for mathematical operations, making them ideal for exploring structured data and performing calculations.

5. **Arrays**

   *NumPy's Mathematical Operations* enable developers to carry out calculations like addition, multiplication, and more directly on arrays without needing loops. This approach is fast, efficient, and simplifies handling large datasets in Python.

---

## Exploring NumPy Features

In [None]:
# Creating a list using native modules of the Python Standard Library
import random

salary_list = [random.randint(5000, 10000) for _ in range(1_000)] # Create a list of 1.000 salaries ranging from 5.000 to 10.000 

In [None]:
# Comparing the execution time for calculating the average salaries using the statistics module and the mean function from the NumPy library
import statistics
from timeit import timeit

time_statistics = timeit(lambda: statistics.mean(salary_list), number = 1000)
print(f"Avereage Salary (Statistics): {statistics.mean(salary_list)} - Execution Time: {time_statistics: .2f}s") 

time_mean = timeit(lambda: np.mean(salary_list), number = 1000)
print(f"Avereage Salary (NumPy): {np.mean(salary_list)} - Execution Time: {time_mean: .2f}s")

In [None]:
# One more example of numpy features
job_titles = np.array(['Data Analyst', 'Data Scientist', 'Data Engineer', 'Machine Learning Engineer', 'AI Engineer']) # Creating an array with Job Titles
base_salaries = np.array([60000, 80000, 75000, 90000, 45000]) # Creating an array with Base Salaries
bonus_rates = np.array([.05, .1, .08, .12, 0]) # Creating an array with Bonus Rates

total_salaries = base_salaries * (1 + bonus_rates) # Calculating the salaries according to the bonus rates
print("Total salaries:", total_salaries)

average_salary = np.mean(total_salaries) # Calculating the average salary
print(f"Average salary: ${average_salary:.2f}")

## Pandas Library

**Pandas**, short for *Python Data Analysis*, is a powerful and flexible library for data manipulation and analysis. It provides high-performance, easy-to-use data structures like **Series** and **DataFrames**, enabling effective handling, organization, and analysis of structured datasets. 

Pandas is widely used for tasks involving data cleaning, transformation, and visualization, making it essential for anyone working in data science or related fields.

---

## Import Pandas Library

In [None]:
import pandas as pd

## Pandas Features

1. **DataFrames**

    *Pandas DataFrames* are two-dimensional, tabular data structures (similar to Excel spreadsheets or SQL tables). DataFrames allow developers to store and manipulate labeled data efficiently.
   
2. **Series**

    *Pandas Series* are one-dimensional data structures akin to a column in a DataFrame. It supports array-like operations and can store heterogeneous data types.

3. **Data Manipulation**

    *Pandas Data Manipulation* simplify transforming and reshaping data. You can filter rows, slice columns, merge datasets, and handle missing values with ease, making it ideal for cleaning and preparing datasets.

4. **Data Cleaning** 

    *Pandas Data Cleaning* are designed to handle incomplete or noisy datasets. With functions for filling missing values, dropping duplicates, and handling null values, Pandas helps to ensure data quality and consistency.

5. **Pandas Aggregation and Grouping**

   *Pandas Aggregation and Grouping* enable summarizing large datasets by grouping rows or columns based on specific criteria. You can calculate metrics like sum, mean, or count, making it easier to analyze trends and patterns.

---

## Exploring Pandas Features

### Inspection DataFrame

In [None]:
netflix_data = pd.read_csv('../input/netflix-data/Netflix Dataset.csv') # Creating a DataFrame
netflix_data.head() #Showing the five first lines

In [None]:
netflix_data.info() # Showing all columns and Non-Null count cells

In [None]:
netflix_data.describe() # Showing statical information about the dataset

In [None]:
# Know more about the DataFrame
print(f"Number of Rows: {netflix_data.shape[0]}") # Rows number
print(f"Number of Columns: {netflix_data.shape[1]}") # Columns number
print(f"Total Size: {netflix_data.size}") # (Total of Cells) DataFrame Size
print(f"DataFrame Columns: {netflix_data.columns.tolist()}") # Showing the DataFrame Columns
print(f"DataFrame Shape: {netflix_data.shape}") # Showing the DataFrame Size (Rows x Columns)

In [None]:
# Selecting a specific value in a specific index
netflix_data.Release_Date[7000]

In [None]:
# Showing unique values for Category Column
netflix_data.Category.unique()

### Cleaning DataFrame

In [None]:
# Searching duplicated records in the DataFrame
netflix_data[netflix_data.duplicated()]

In [None]:
# Removing duplicated values
netflix_data.drop_duplicates(inplace = True)
print(f"DataFrame Shape after removing duplicated values: {netflix_data.shape}")

In [None]:
# Verifying NaN values sum for each column
netflix_data.isnull().sum()

In [None]:
# Replacing NaN values to 'Unknown'
netflix_data['Director'] = netflix_data['Director'].fillna('Unknown')
netflix_data['Cast'] = netflix_data['Cast'].fillna('Unknown')
netflix_data['Country'] = netflix_data['Country'].fillna('Unknown')
print(f"DataFrame Shape after Replacing NaN values: {netflix_data.shape}")

In [None]:
# Removing NaN values
netflix_data.dropna(subset='Rating', inplace = True)
netflix_data.dropna(subset='Release_Date', inplace = True)
print(f"DataFrame Shape after Removing NaN values: {netflix_data.shape}")

## Matplotlib Library

**Matplotlib**, short for *Mathematical Plotting Library*, is a powerful and flexible library for data visualization in Python. It provides a wide array of tools to create static, animated, and interactive visualizations, making it essential for exploring patterns and trends in datasets.

Matplotlib is widely used in data science, machine learning, and scientific computing to create high-quality plots, charts, and graphs. It serves as the foundation for other advanced visualization libraries like Seaborn and Plotly

---

## Import Matplotlib Library

In [None]:
import matplotlib.pyplot as plt

## Matplotlib Features

1. **Line Plots**

    *Matplotlib's Line Plots* allow users to visualize relationships between two variables over a continuous range. They're ideal for showing trends, changes over time, or data patterns.
   
2. **Bar Charts**

    *Bar Charts* help visualize categorical data with rectangular bars representing values. They're ideal for comparing different groups or tracking frequency.

3. **Histograms**

    *Histograms* are used for visualizing the distribution of a dataset. They divide data into bins and display the frequency of values in each range.

4. **Pie Charts** 

    *Pie Charts* visually depict proportions within a dataset as slices of a circular pie. They're ideal for simple comparisons of categorical data.

5. **Scatter Plots**

   *Pandas Aggregation and Grouping* represent relationships between two numeric variables using points. They're ideal for identifying correlations, clusters, and outliers.

---

## Exploring Matplotlib Features

In [None]:
# Creating a Histogram to show the distribution of Category Column
plt.hist(netflix_data['Category'], bins=3)
plt.title("Category Histogram")
plt.show()

In [None]:
# Creating a Pie Chart to show the distribution of 5 highest values in the Type Column
type_counts = netflix_data['Type'].value_counts().nlargest(5)
labels = type_counts.index # type description
values = type_counts.values
plt.pie(values, labels=labels, autopct='%1.1f%%')
plt.title("Type Pie Chart")
plt.show()

## Seaborn Library

**Seaborn**, built on top of *Matplotlib*, is a powerful Python library for data visualization. It simplifies the creation of aesthetically pleasing and informative plots by providing high-level interfaces for statistical graphics.

It was designed to make statistical visualization more intuitive, serving as an essential tool for exploring relationships in datasets and discovering patterns. Seaborn works seamlessly with Pandas DataFrames, making it a favorite in the fields of data science and machine learning.

---

## Import Seaborn Library

In [None]:
import seaborn as sns

## Seaborn Features

1. **Relationship Exploration**

    *Seaborn's Relationship Exploration tools* allow you to visualize changes or correlations between two variables using scatterplots (for raw points) and line plots (for trends). These tools make relationship analysis simple and effective.
   
2. **Distribution Visualization**

    *Seaborn's Distribution Visualization tools* display the shape, spread, and range of your data, helping you understand its overall structure. Common options include histograms, KDE plots, and boxplots.

3. **Categorical Data **

    *Seaborn's Categorical Data tools* enable analysis of datasets organized by categories. Bar plots, count plots, and violin plots can uncover differences across groups or highlight trends.

4. **Heatmaps** 

    *Seaborn's Heatmaps* provide a visual representation of data matrices, allowing users to explore relationships and trends across numerical values effectively.

5. **Pairwise Relationships**

   *Pairwise Relationships tools* enable developers to carry out calculations like addition, multiplication, and more directly on arrays without needing loops. This approach is fast, efficient, and simplifies handling large datasets in Python.

---

## Exploring Seaborn Features

In [None]:
# Create a Distribution of Movies vs TV Shows by Top 5 Countries
netflix_data['Country_Split'] = netflix_data['Country'].str.split(',').str[0].str.strip()
content_data = netflix_data.pivot_table(index='Country_Split', columns='Category', aggfunc='size', fill_value=0).reset_index() # Create pivot table with counts for Movies and TV Shows by Country
content_data['Total'] = content_data['Movie'] + content_data['TV Show'] # Add a Total column for sorting purposes
top_5_countries = content_data.nlargest(5, 'Total').drop(columns=['Total']) # Select the top 5 countries based on the total number of titles

# Melt the pivot table for Seaborn compatibility
content_data_melted = top_5_countries.melt(id_vars='Country_Split', var_name='Category', value_name='Count')

plt.figure(figsize=(12, 6))
sns.barplot(data=content_data_melted, x='Country_Split', y='Count', hue='Category', palette=['skyblue', 'salmon']) # Create the barplot

# Customize the plot
plt.title("Movies vs TV Shows Distribution by Top 5 Countries")
plt.xlabel("Country")
plt.ylabel("Number of Titles")
plt.xticks(rotation=45, ha="right")  # Rotate x-axis labels for better readability
plt.legend(title="Category")
plt.tight_layout()

# Show the plot
plt.show()

Content created based on *Python for Data Analytics - Full Course for Beginners* by **Luke Barousse**