<a href="https://colab.research.google.com/github/txusser/Master_IA_Sanidad/blob/main/Modulo_2/2_9_Librerias_Ciencia_Datos_Numpy_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Python Libraries for Data Science

### Numpy
NumPy's n-dimensional arrays are faster and more compact than Python lists. An array consumes less memory and is convenient to use. NumPy uses much less memory to store data and provides a mechanism to specify data types. This allows further optimization of the code.

* More information about Numpy in [Numpy fundamentals](https://numpy.org/doc/stable/user/basics.html)


In [None]:
from rich import print
from rich.console import Console
console = Console()

# We add a title for the NumPy arrays section
console.rule("[cyan]Start numpy arrays[/cyan]")

# Import the NumPy library as 'np' (common convention)
import numpy as np


### Getting started and working with arrays:


In [None]:
# We create a Python list with numbers from 0 to 4
l = list(range(5))
print("List l:", l)

# We create a NumPy array with numbers from 0 to 4
a = np.arange(5)
print("Array a: ", a)

print("type(l):", type(l))  # Display the data type of 'l'
print("type(a):", type(a))  # Display the data type of 'a'

# We create a NumPy array with numbers from 1 to 20, with a step of 3
a = np.arange(4, 21, 3)
print(f"\n- a = {a}")

# We create a NumPy array with odd numbers from 1 to 19
a = np.arange(1, 20, 2)
print(f"\n- a = {a}")

# Display the data type of the array 'a'
print("\n- Data type of a:", type(a))


In [None]:
# We create a new Python list
l = [1, 2, 4, 6]
print("List l=", l)
print("type(l):", type(l))

# Convert the Python list into a NumPy array
alist = np.array(l)
print("\n- alist:", alist)
print(" = Data type of alist:", type(alist))

# Show the dimensionality (shape) of the NumPy array
print("Dimensions:", alist.shape)


In [None]:
console.rule("[cyan]Creation of arrays[/cyan]")

# One-dimensional array
print("\n=== One-dimensional array ===")
x = np.linspace(0, 9, 21)
print(f"x = {x}\n")

# One-dimensional array with specific elements
print("=== One-dimensional array with specific elements ===")
y = np.array([1, 3, 5, 7, 9])
print(f"y = {y}\n")

# One-dimensional array with random elements
print("=== Array with random elements ===")
z = np.random.rand(1)
print(f"z = \n{z}\n")
z = np.random.rand(1)
print(f"z = \n{z}\n")

# Generate random numbers and calculate mean and standard deviation
a = np.random.rand(10000)
print("-a shape:", a.shape)
print("mean:", np.mean(a))
print("std:", np.std(a))


In [None]:
# Multidimensional array with random elements
print("=== Multidimensional array with random elements ===")
z = np.random.rand(4, 3)  # Creates an array with 4 rows and 3 columns, with random numbers between 0 and 1
print(f"z = \n{z}")
print(f"Shape of z: {z.shape}")  # Displays the dimensions of the array
print(f"Data type of z: {z.dtype}\n")  # Displays the data type of the array
z = np.rint(z).astype(np.int8)
print("Int8:", z)
print(f"Data type of z: {z.dtype}\n")  # Displays the data type of the array

# Array of zeros
print("=== Array of zeros ===")
zeros = np.zeros((2, 5))
print(f"zeros = \n{zeros}\n")

# Array of ones
print("=== Array of ones ===")
ones = np.ones((3, 3))
print(f"ones = \n{ones}\n")


### Manipulation of multidimensional arrays in Python
* Creation of 2D arrays, type conversion (from float to int), rounding, and operations with arrays such as identifying unique values.
* Working with arrays of mixed elements and generating large random arrays.

The utility of these operations lies in their efficiency for handling large datasets, which is essential in statistical analysis, image processing, and machine learning.


In [None]:
console.rule("[cyan]Higher Dimensional Arrays[/cyan]")

# Higher-dimensional array
print("\n=== Higher-dimensional array ===")
y = np.array([[1.48889, 2.89, 3.777], [4.324353463, 5, 6], [1, 2, 5], [2, 2, 2]], np.float16)
print(f"y = {y}\nShape: {y.shape}")
y = np.rint(y).astype(np.int8) # The astype method allows us to modify the data type of the array
print(f"\nConversion to closest integers => y: \n{y}")

# Bidimensional array with mixed elements
print("\n=== Bidimensional array with mixed elements ===")
a = np.array([[1, 2, 2, 4, 3, 4], [1, 22.345245, 2, 42, 3, 4]])
print("Array a:", a)

# Get unique values
z, c = np.unique(a, return_counts=True)
print(f"= Unique values in the array: {z}")
print("How many times each value repeats:", c)

# Count unique values
z = np.random.rand(5040, 40)
print("Unique values in z:", len(np.unique(z)))
print("Size; 5040*40=", 5040*40)


### Basic Descriptive Statistics

Basic statistical methods that can be applied to NumPy arrays, essential for exploratory data analysis in data science: calculating the maximum, minimum, mean, standard deviation, and percentiles of an array.


In [None]:
console.rule("[cyan]Various Methods to Execute on Arrays[/cyan]")

y=z
print("Y:", y)

print("\nMaximum value:")
print("Max:", np.max(y))

print("\nMinimum value:")
print("Min:", np.min(y))

print("\nMean:")
print("Mean:", np.mean(y))

print("\nStandard deviation:")
print("Std:", np.std(y))

print("\n25th Percentile:")
print("Pct:", np.percentile(y, q=25))
print("Pct Q99:", np.percentile(y, q=99))


In [None]:
# Complement of descriptive statistics

print("\nMedian:")
print("Median:", np.median(y))

print("\nVariance:")
print("Var:", np.var(y))

print("\nSum of all elements:")
print("Sum:", np.sum(y))

print("\nProduct of all elements:")
print("Product:", np.prod(y))

print("\nRange (difference between maximum and minimum):")
print("Range:", np.ptp(y))

print("\nQuartiles:")
print("Q1, Q2, Q3:", np.percentile(y, [25, 50, 75]))


In [None]:
console.rule("[cyan]Slicing arrays[/cyan]")
# Slicing with NumPy arrays

c = np.arange(1, 11, 1)
print("c:", c)

print("\nFirst 2 values:")
print(c[:2])

print("\nLast 8 elements:")
print(c[2:])

print("\nLast 2 values:")
print(c[-2:])

print("\nFirst 8 values:")
print(c[:-2])


## Numpy filters

In [None]:
console.rule("[cyan]Filters on arrays (image data)[/cyan]")

import matplotlib.pyplot as plt
import numpy as np
from skimage import data, filters

# Load the sample image
image = data.camera()
print("image shape:", image.shape)

# Create a figure with subplots
fig, ax = plt.subplots(1, 3, figsize=(12, 4))

# Display the original image
ax[0].imshow(image, cmap='viridis')
ax[0].set_title('Original Image')

# Apply Gaussian blur filter
blurred = filters.gaussian(image, sigma=16)
ax[1].imshow(blurred, cmap='gray')
ax[1].set_title('Gaussian Blur Filter')

# Apply Sobel filter for edge detection
edges = filters.sobel(image)
ax[2].imshow(edges, cmap='gray')
ax[2].set_title('Sobel Filter for Edge Detection')

# Adjust the spacing between subplots
plt.tight_layout()

# Display the figure
plt.show()


In [None]:
import matplotlib.pyplot as plt
import numpy as np
from skimage import data, filters

# Load the sample image
image = data.camera()

# Calculate the 5th and 95th percentiles of the image
percentile_50 = np.percentile(image, 50)
percentile_99 = np.percentile(image, 99)

# Apply a threshold to the image to eliminate values below the 5th percentile
# and above the 95th percentile
thresholded = np.clip(image, percentile_50, percentile_99)

# Create a figure with subplots
fig, ax = plt.subplots(1, 2, figsize=(12, 4))

# Display the original image
ax[0].imshow(image, cmap='gray')
ax[0].set_title('Original Image')

# Display the image with values outside the range clipped
ax[1].imshow(thresholded, cmap='gray')
ax[1].set_title('Clipped Values (5%-95%)')

plt.show()

# Create a new figure with histograms
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
ax[0].hist(image.ravel(), bins=256, histtype='step', color='black')
ax[1].hist(thresholded.ravel(), bins=256, histtype='step', color='black')

# Show the histograms
plt.show()


### Pandas
The [User Guide](https://pandas.pydata.org/docs/user_guide/index.html) covers all of pandas by topic area. Each of the subsections introduces a topic (such as “working with missing data”), and discusses how pandas approaches the problem, with many examples throughout.

[API documentation](https://pandas.pydata.org/docs/reference/index.html)

In [None]:
console.rule("[cyan]Read CSV data with Pandas[/cyan]")

import pandas as pd
from tabulate import tabulate # For better visualization of tables


data_file = '/content/sample_data/california_housing_train.csv'

# Read the CSV file and load the data into a DataFrame
df = pd.read_csv(data_file)
df


In [None]:
df = pd.read_csv(data_file)

# Display basic information about the DataFrame
print("Basic information about the DataFrame:")
print(f" - Number of rows: {df.shape[0]}")
print(f" - Number of columns: {df.shape[1]}")

# Display the columns of the DataFrame
print("\nColumns of the DataFrame:")
print(df.columns.tolist())

# Display the first rows of the DataFrame
print("\nFirst 4 rows of the DataFrame:")
print(df.head(14))
print(tabulate(df.head(4)))

In [None]:
# Display a range of rows from the DataFrame
print("\nRow range (457-462):")
print(df.iloc[457:459])
print("Filter values:", df.loc[df['median_income'] > 4.767000])

df_s = df.loc[df['median_income'] > 4.767000]
print("Rows in the initial df:", df.shape[0])
print("Rows in the selected df:", df_s.shape[0])

In [None]:
# Show detailed information about the DataFrame
print("\nDetailed information of the DataFrame:")
df_s.info()

In [None]:
# Show descriptive statistics of the DataFrame
print("\nDescriptive statistics of the DataFrame:")
print(df.describe())

In [None]:
# Show the data types of the columns
console.rule("\nData types of the columns:")
print(df.dtypes)

# Verify if there are any null values in the DataFrame
console.rule("\nNull values in the DataFrame:")
print(df.isnull().sum())

# Show a random sample of 5 rows
console.rule("\nRandom sample of 5 rows:")
print(df.sample(5))


In [None]:
console.rule("[cyan]Other Basic Operations[/cyan]")

# Create a DataFrame from a dictionary
data = {
    'Name': ['Juan', 'María', 'Pedro', 'Ana'],
    'Age': [25, 30, 40, 35],
    'City': ['Madrid', 'Madrid', 'Valencia', 'Sevilla'],
    'Salary': [255000, 24444, 30000, 42222]
}
df = pd.DataFrame(data)

# Show the last rows of the DataFrame
print("Last rows:\n", df.tail(2))

# Filter the DataFrame based on a condition
filtered_df = df[df['Age'] > 25]
print("\nFiltered by age:\n", filtered_df)

# Sort the DataFrame by a column
sorted_df = df.sort_values(['Age'], ascending=False)
print("\nSorted by age:\n", sorted_df)


In [None]:
console.rule("[cyan]Basic Operations with Pandas DataFrames[/cyan]")

# Select specific columns
selected_columns = df[['Name', 'City']]
print("\n\nSelection of 'Name' and 'Salary' columns:")
print(selected_columns)

# Add a new column
df['Years_Experience'] = [3, 5, 10, 7]
print("\nDataFrame with new column 'Years_Experience':")
print(df)


In [None]:
# Calculate statistics by group
group_stats = df.groupby('City')['Salary'].mean()
print("\n\nAverage salary by city:")
print(group_stats)

# Rename columns
df = df.rename(columns={'Years_Experience': 'Experience'})
print("\n\nDataFrame with renamed column:")
print(df)

# Add a new calculated column to the DataFrame
df['Double_Age'] = df['Age'] * 2
print("\nDataFrame with additional column:\n", df)


In [None]:
# Count unique values in a column
counts = df['City'].value_counts()
print("\nUnique values:\n", counts)

# Calculate the mean of a column
mean_age = df['Age'].mean()
print("\nMean age:", mean_age)

# Remove a column from the DataFrame
console.rule("Remove a column from the DataFrame")
df = df.drop('Double_Age', axis=1)
print("\nDataFrame without the Double_Age column:\n", df)


# Seaborn

Seaborn is a data visualization library based on Matplotlib that provides a high-level interface to create attractive and informative graphics. It is especially designed to work with pandas DataFrames and integrates well with Python's data analysis capabilities.

Link to Seaborn's official documentation: [Seaborn Documentation](https://seaborn.pydata.org/)

In [None]:
console.rule("[cyan]Gráficos con Seaborn[/cyan]")

import seaborn as sns

# Load the "iris" dataset
iris = sns.load_dataset('iris')
print("IRIS dataset:\n", iris)
print("IRIS type:", iris.columns)
print("Species:", np.unique(iris['species'].values))

In [None]:
# Distribution plot (histogram and KDE)
sns.histplot(iris['sepal_width'])

mean = iris['sepal_length'].mean()
std = iris['sepal_length'].std()
plt.title(f'Distribution of Sepal Length: Mean = {round(mean,2)} Std = {round(std,2)}')
plt.xlabel('Sepal Length')
plt.ylabel('Counts')
plt.show()

In [None]:
# Scatter plot
sns.scatterplot(x='sepal_length', y='sepal_width', hue='species', data=iris)
plt.title('Relation between Sepal Length and Sepal Width')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.show()

In [None]:
# Bar plot
sns.barplot(x='species', y='petal_length', data=iris, color="cyan")
plt.title('Average Petal Length by Species')
plt.xlabel('Species')
plt.ylabel('Average Petal Length')
plt.show()

In [None]:
# Box plot
sns.boxplot(data=iris)
plt.title('Petal Width Distribution by Species')
plt.xlabel('Variable')
plt.ylabel('Length/Width [cm]')
plt.show()