# <span style="color:#323232"> Introducción al análisis de datos con Numpy, Pandas y Matplotlib </span>

## <span style="color:#323232"> Análisis de datos </span>

<span style="color:#323232; font-size:18px; line-height:35px; font-family: Calibri;">El análisis de datos es un proceso de inspeccionar, limpiar, transformar y modelar datos con el objetivo de descubrir información útil, sugerir conclusiones y apoyar la toma de decisiones. </span>

### Pasos para el análisis, manipulación y visualización de datos:
1. Transformar datos sin procesar en el formato deseado
2. Limpiar los datos transformados (Pasos 1 y 2 en conjunto de llama **preprocesamiento de datos**
3. Prepara un modelo
4. Analizar tendencias y tomar decisiones

<span style="color:#1D4289; font-size:18px; line-height:35px; font-family: Calibri;">Para analizar datos usaremos las librerías ``Numpy``,``Pandas``, ``Matplotlib`` y ``Seaborn``. </span>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
%matplotlib inline

### <span style="color:#D3273E">Lectura y escritura de datos desde archivos </span>

In [None]:
# Lectura de datos desde archivo CSV a un arreglo NumPy
arr_csv = np.genfromtxt('Hurricanes.csv', delimiter = ',')
np.savetxt('newfilex.csv', arr_csv, delimiter = ',')

In [None]:
# Lectura de datos desde archivo CSV a un DataFrame Pandas
table_csv = pd.read_csv('Cars2015.csv')

In [None]:
table_csv.to_csv('newcars2015.csv')

## <span style="color:#D3273E"> Matplotlib </span>

1. Matplotlib es una biblioteca de Python especialmente diseñada para el desarrollo de gráficos, tablas, etc., con el fin de proporcionar visualización de datos interactiva.
2. Matplotlib está inspirado en el software MATLAB y reproduce muchas de sus características.

#### <span style="color:#1D4289"> Gráficas sencillas </span>

In [None]:
plt.plot([1,2,3,4])
plt.show()

In [None]:
x = [x/4 for x in range(21)]
plt.plot(x, [x1**2 for x1 in x])
plt.show()

In [None]:
x = np.arange(0, 5, 0.01)
plt.plot(x, [x1**2 for x1 in x])
plt.show()

#### <span style="color:#1D4289"> Gráficas múltiples </span>

In [None]:
plt.style.use('default')
x = [x/4 for x in range(21)]
plt.plot(x, [x1 for x1 in x])
plt.plot(x, [x1*x1 for x1 in x])
plt.plot(x, [x1*x1*x1 for x1 in x])
plt.show()

In [None]:
x = [x/4 for x in range(21)]
plt.plot(x, [x1 for x1 in x],
         x, [x1*x1 for x1 in x],
         x, [x1*x1*x1 for x1 in x])
plt.show()

#### <span style="color:#1D4289"> Grillas </span>

In [None]:
x = [x/4 for x in range(21)]
plt.plot(x, [x1 for x1 in x],
         x, [x1*2 for x1 in x],
         x, [x1*4 for x1 in x])
plt.grid(True)
plt.show()

#### <span style="color:#1D4289"> Límites de los ejes </span>

In [None]:
x = range(5)
plt.plot(x, [x1 for x1 in x],
         x, [x1*2 for x1 in x],
         x, [x1*4 for x1 in x])
plt.grid(True)
plt.axis([-1, 5, -1, 10]) # Sets new axes limits
plt.show()

In [None]:
x = range(5)
plt.plot(x, [x1 for x1 in x],
         x, [x1*2 for x1 in x],
         x, [x1*4 for x1 in x])
plt.grid(True)
plt.xlim(-1, 5)
plt.ylim(-1, 10)
plt.show()

#### <span style="color:#1D4289"> Etiquetas </span>

In [None]:
x = range(5)
plt.plot(x, [x1 for x1 in x],
         x, [x1*2 for x1 in x],
         x, [x1*4 for x1 in x])
plt.grid(True)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

#### <span style="color:#1D4289"> Títulos </span>

In [None]:
x = range(5)
plt.plot(x, [x1 for x1 in x],
         x, [x1*2 for x1 in x],
         x, [x1*4 for x1 in x])
plt.grid(True)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title("Polynomial Graph")
plt.show()

#### <span style="color:#1D4289"> Leyendas </span>

In [None]:
x = np.arange(5)
plt.plot(x, x, label = 'linear')
plt.plot(x, x*x, label = 'square')
plt.plot(x, x*x*x, label = 'cube')
plt.grid(True)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title("Polynomial Graph")
plt.legend()
plt.show()

#### <span style="color:#1D4289"> Marcadores </span>

In [None]:
x = [1, 2, 3, 4, 5, 6]
y = [11, 22, 33, 44, 55, 66]
plt.plot(x, y, 'bo')
for i in range(len(x)):
    x_cord = x[i]
    y_cord = y[i]
    plt.text(x_cord, y_cord, (x_cord, y_cord), fontsize = 10)
plt.show()

#### <span style="color:#1D4289"> Guardar gráficas </span>

In [None]:
x = np.arange(5)
plt.plot(x, x, label = 'linear')
plt.plot(x, x*x, label = 'square')
plt.plot(x, x*x*x, label = 'cube')
plt.grid(True)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title("Polynomial Graph")
plt.legend()
plt.savefig('plot.png') # Saves an image names 'plot.png' in the current directory
plt.show()

#### <span style="color:#1D4289"> Histograma </span>

In [None]:
y = np.random.randn(100, 100)
plt.hist(y)
plt.show()

In [None]:
y = np.random.randn(1000)
plt.hist(y, 100)
plt.show()

#### <span style="color:#1D4289"> Gráficas de barras </span>

In [None]:
plt.bar([1,2,3], [1,4,9])
plt.show()

In [None]:
dictionary = {'A':25, 'B':70, 'C':55, 'D':90}
for i, key in enumerate(dictionary):
    plt.bar(i, dictionary[key])
plt.show()

In [None]:
dictionary = {'A':25, 'B':70, 'C':55, 'D':90}
for i, key in enumerate(dictionary):
    plt.bar(i, dictionary[key])
plt.xticks(np.arange(len(dictionary)), dictionary.keys())
plt.show()

#### <span style="color:#1D4289"> Gráfica circular </span>

In [None]:
plt.figure(figsize = (3,3))
x = [40, 20, 5
labels = ['Bikes', 'Cars', 'Buses']
plt.pie(x, labels = labels)
plt.show()

#### <span style="color:#1D4289"> Gráfica de dispersión </span>

In [None]:
x = np.random.rand(1000)
y = np.random.rand(1000)
plt.scatter(x, y)
plt.show()

#### <span style="color:#1D4289"> Personalización </span>

In [None]:
y = np.arange(1, 3)
plt.plot(y, 'y')
plt.plot(y+5, 'm')
plt.plot(y+10, 'c')
plt.show()

In [None]:
y = np.arange(1, 100)
plt.plot(y, '--', y*5, '-.', y*10, ':')
plt.show()

In [None]:
y = np.arange(1, 3, 0.2)
plt.plot(y, '*',
        y+0.5, 'o',
        y+1, 'D',
        y+2, '^',
        y+3, 's')
plt.show()

In [None]:
matplotlib.style.available

In [None]:
import matplotlib.pyplot as plt
import numpy as np

import matplotlib.colors as mcolors
from matplotlib.patches import Rectangle

np.random.seed(19680801)


def plot_scatter(ax, prng, nb_samples=100):
    """Scatter plot."""
    for mu, sigma, marker in [(-.5, 0.75, 'o'), (0.75, 1., 's')]:
        x, y = prng.normal(loc=mu, scale=sigma, size=(2, nb_samples))
        ax.plot(x, y, ls='none', marker=marker)
    ax.set_xlabel('X-label')
    ax.set_title('Axes title')
    return ax


def plot_colored_lines(ax):
    """Plot lines with colors following the style color cycle."""
    t = np.linspace(-10, 10, 100)

    def sigmoid(t, t0):
        return 1 / (1 + np.exp(-(t - t0)))

    nb_colors = len(plt.rcParams['axes.prop_cycle'])
    shifts = np.linspace(-5, 5, nb_colors)
    amplitudes = np.linspace(1, 1.5, nb_colors)
    for t0, a in zip(shifts, amplitudes):
        ax.plot(t, a * sigmoid(t, t0), '-')
    ax.set_xlim(-10, 10)
    return ax


def plot_bar_graphs(ax, prng, min_value=5, max_value=25, nb_samples=5):
    """Plot two bar graphs side by side, with letters as x-tick labels."""
    x = np.arange(nb_samples)
    ya, yb = prng.randint(min_value, max_value, size=(2, nb_samples))
    width = 0.25
    ax.bar(x, ya, width)
    ax.bar(x + width, yb, width, color='C2')
    ax.set_xticks(x + width, labels=['a', 'b', 'c', 'd', 'e'])
    return ax


def plot_colored_circles(ax, prng, nb_samples=15):
    """
    Plot circle patches.

    NB: draws a fixed amount of samples, rather than using the length of
    the color cycle, because different styles may have different numbers
    of colors.
    """
    for sty_dict, j in zip(plt.rcParams['axes.prop_cycle'](),
                           range(nb_samples)):
        ax.add_patch(plt.Circle(prng.normal(scale=3, size=2),
                                radius=1.0, color=sty_dict['color']))
    ax.grid(visible=True)

    plt.title('ax.grid(True)', family='monospace', fontsize='small')

    ax.set_xlim([-4, 8])
    ax.set_ylim([-5, 6])
    ax.set_aspect('equal', adjustable='box')  # to plot circles as circles
    return ax


def plot_image_and_patch(ax, prng, size=(20, 20)):
    """Plot an image with random values and superimpose a circular patch."""
    values = prng.random_sample(size=size)
    ax.imshow(values, interpolation='none')
    c = plt.Circle((5, 5), radius=5, label='patch')
    ax.add_patch(c)
    # Remove ticks
    ax.set_xticks([])
    ax.set_yticks([])


def plot_histograms(ax, prng, nb_samples=10000):
    """Plot 4 histograms and a text annotation."""
    params = ((10, 10), (4, 12), (50, 12), (6, 55))
    for a, b in params:
        values = prng.beta(a, b, size=nb_samples)
        ax.hist(values, histtype="stepfilled", bins=30,
                alpha=0.8, density=True)

    ax.annotate('Annotation', xy=(0.25, 4.25),
                xytext=(0.9, 0.9), textcoords=ax.transAxes,
                va="top", ha="right",
                bbox=dict(boxstyle="round", alpha=0.2),
                arrowprops=dict(
                          arrowstyle="->",
                          connectionstyle="angle,angleA=-95,angleB=35,rad=10"),
                )
    return ax


def plot_figure(style_label=""):
    """Setup and plot the demonstration figure with a given style."""
    prng = np.random.RandomState(96917002)

    fig, axs = plt.subplots(ncols=6, nrows=1, num=style_label,
                            figsize=(14.8, 2.8), layout='constrained')

    background_color = mcolors.rgb_to_hsv(
        mcolors.to_rgb(plt.rcParams['figure.facecolor']))[2]
    if background_color < 0.5:
        title_color = [0.8, 0.8, 1]
    else:
        title_color = np.array([19, 6, 84]) / 256
    fig.suptitle(style_label, x=0.01, ha='left', color=title_color,
                 fontsize=14, fontfamily='DejaVu Sans', fontweight='normal')

    plot_scatter(axs[0], prng)
    plot_image_and_patch(axs[1], prng)
    plot_bar_graphs(axs[2], prng)
    plot_colored_lines(axs[3])
    plot_histograms(axs[4], prng)
    plot_colored_circles(axs[5], prng)

    rec = Rectangle((1 + 0.025, -2), 0.05, 16,
                    clip_on=False, color='gray')

    axs[4].add_artist(rec)

if __name__ == "__main__":

    style_list = ['default', 'classic'] + sorted(
        style for style in plt.style.available
        if style != 'classic' and not style.startswith('_'))

    for style_label in style_list:
        with plt.rc_context({"figure.max_open_warning": len(style_list)}):
            with plt.style.context(style_label):
                plot_figure(style_label=style_label)

    plt.show()

## <span style="color:#D3273E"> Seaborn </span>

Una manera fácil de hacer que sus gráficos se vean más bonitos es usar algunos estilos predeterminados de la biblioteca de **Seaborn**. Estos se pueden aplicar globalmente usando la función ``sns.set_style``. Puede ver una lista completa de estilos predefinidos aquí:

In [None]:
years = range(2000, 2012)
apples = [0.895, 0.91, 0.919, 0.926, 0.929, 0.931, 0.934, 0.936, 0.937, 0.9375, 0.9372, 0.939]
oranges = [0.962, 0.941, 0.930, 0.923, 0.918, 0.908, 0.907, 0.904, 0.901, 0.898, 0.9, 0.896, ]

In [None]:
sns.set_style("whitegrid")
plt.plot(years, apples, 's-b')
plt.plot(years, oranges, 'o--r')

plt.xlabel('Año')
plt.ylabel('Rendimiento (toneladas por hectárea)')

plt.title("Rendimiento de los cultivos en Kanto")
plt.legend(['Manzanas', 'Naranjas']);

In [None]:
sns.set_style("darkgrid")
plt.plot(years, apples, 's-b')
plt.plot(years, oranges, 'o--r')

plt.xlabel('Año')
plt.ylabel('Rendimiento (toneladas por hectárea)')

plt.title("Rendimiento de los cultivos en Kanto")
plt.legend(['Manzanas', 'Naranjas']);

#### <span style="color:#1D4289"> Gráfica de dispersión por grupos </span>

In [None]:
flowers_df = sns.load_dataset("iris")
sns.scatterplot(x=flowers_df.sepal_length, 
                y=flowers_df.sepal_width, 
                hue=flowers_df.species, 
                s=100);

In [None]:
plt.figure(figsize=(12, 6))
plt.title('Sepal Dimensions')

sns.scatterplot(x=flowers_df.sepal_length, 
                y=flowers_df.sepal_width, 
                hue=flowers_df.species,
                s=100);

In [None]:
plt.title('Petal Dimensions')
sns.scatterplot(x='petal_length', 
                y='petal_width', 
                hue='species',
                s=100,
                data=flowers_df);

#### <span style="color:#1D4289"> Histograma por grupos </span>

In [None]:
setosa_df = flowers_df[flowers_df.species == 'setosa']
versicolor_df = flowers_df[flowers_df.species == 'versicolor']
virginica_df = flowers_df[flowers_df.species == 'virginica']
plt.hist(setosa_df.sepal_width, alpha=0.4, bins=np.arange(2, 5, 0.25));
plt.hist(versicolor_df.sepal_width, alpha=0.4, bins=np.arange(2, 5, 0.25));
plt.hist(virginica_df.sepal_width, alpha=0.4, bins=np.arange(2, 5, 0.25));

#### <span style="color:#1D4289"> Gráficas de barras (mejorados) </span>

In [None]:
tips_df = sns.load_dataset("tips");
sns.barplot(x='day', y='total_bill', data=tips_df);

In [None]:
sns.barplot(x='day', y='total_bill', hue='sex', data=tips_df);

In [None]:
sns.barplot(x='total_bill', y='day', hue='sex', data=tips_df);

#### <span style="color:#1D4289"> Mapas de calor </span>

In [None]:
flights_df = sns.load_dataset("flights").pivot(index = 'month', 
                                               columns = 'year', 
                                               values = 'passengers')
flights_df.head(5)

In [None]:
plt.title("No. de pasajeros (1000s)")
sns.heatmap(flights_df);

In [None]:
sns.heatmap(flights_df, fmt="d", annot=True, cmap='Blues');

#### <span style="color:#1D4289"> Gráfico de series temporales con bandas de error </span>

In [None]:
fmri = sns.load_dataset("fmri")
sns.lineplot(x="timepoint", y="signal",
             hue="region", style="event",
             data=fmri)

#### <span style="color:#1D4289"> Múltiples gráficas en una figura </span>

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(16, 8))

axes[0,0].plot(years, apples, 's-b')
axes[0,0].plot(years, oranges, 'o--r')
axes[0,0].set_xlabel('Year')
axes[0,0].set_ylabel('Yield (tons per hectare)')
axes[0,0].legend(['Apples', 'Oranges']);
axes[0,0].set_title('Crop Yields in Kanto')

axes[0,1].set_title('Sepal Length vs. Sepal Width')
sns.scatterplot(x=flowers_df.sepal_length, 
                y=flowers_df.sepal_width, 
                hue=flowers_df.species, 
                s=100, 
                ax=axes[0,1]);

axes[0,2].set_title('Distribution of Sepal Width')
axes[0,2].hist([setosa_df.sepal_width, versicolor_df.sepal_width, virginica_df.sepal_width], 
         bins=np.arange(2, 5, 0.25), 
         stacked=True);

axes[0,2].legend(['Setosa', 'Versicolor', 'Virginica']);

axes[1,0].set_title('Restaurant bills')
sns.barplot(x='day', y='total_bill', hue='sex', data=tips_df, ax=axes[1,0]);

axes[1,1].set_title('Flight traffic')
sns.heatmap(flights_df, cmap='Blues', ax=axes[1,1]);

axes[1,2].set_title('fMRI responses')
sns.lineplot(x="timepoint", y="signal",
             hue="region", style="event",
             data=fmri)

plt.tight_layout(pad=2);

#### <span style="color:#1D4289"> Gráficos por pares </span>

In [None]:
sns.pairplot(flowers_df, hue='species');

In [None]:
sns.pairplot(tips_df, hue='sex');