# Desafío 1 - Conceptos previos a Big Data
- Constanza Córdova

## Ejercicio 1: Generación Artificial de Datos

+ A continuación se define la función create_random_row , la cual generará un registro artificial de un cliente en una compañía de seguros:

```python
import random

def create_random_row():
    # simulamos la columna edad
    age = random.randint(18, 90)
    # simulamos la columna ingreso
    income = random.randrange(10000, 1000000, step=1000)
    # simulamos la situación laboral
    employment_status = random.choice(['Unemployed', 'Employed'])
    # simulamos si es que tiene deuda o no
    debt_status = random.choice(['Debt', 'No Debt'])
    # simulamos si es que se cambió recientemente o no
    churn_status = random.choice(['Churn', 'No Churn'])
    return age, income, employment_status, debt_status, churn_status

# ejecución
create_random_row()
```
+ Replique la función 10 millones de veces y preservela en un objeto.

Algunos supuestos:
* Asuma, de aquí en adelante, que los datos generados representarán mediciones empíricas sobre el comportamiento de clientes en la compañía de seguros.
* Considere el siguiente ambiente de trabajo de su computador: No tiene instalada la distribución anaconda, por lo que no tendrá acceso a las librerías ```pandas```, ```numpy``` y ```scipy```. Tampoco tiene permisos de usuario, por lo cual no podrá instalarlas. Sólo puede implementar funciones nativas de Python.
* Dado que su código podrá ser utilizado posteriormente en una aplicación web de uso interno montada en Scala, debe utilizar operaciones vectorizadas como `map` , `filter` , `reduce` ; y comprensiones de lista.

### SOLUCIÓN:

In [1]:
import random

def create_random_row():
    age = random.randint(18, 90)
    income = random.randrange(10000, 1000000, step=1000)
    employment_status = random.choice(['Unemployed', 'Employed'])
    debt_status = random.choice(['Debt', 'No Debt'])
    churn_status = random.choice(['Churn', 'No Churn'])
    return age, income, employment_status, debt_status, churn_status



In [2]:
list(create_random_row())

[30, 492000, 'Employed', 'Debt', 'Churn']

In [3]:
mi_l = [x for x in range(5)]
print(mi_l)

mi_l2 = []
for x in range(5):
    mi_l2.append(x)
print(mi_l2)

[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]


In [4]:
# fijamos semilla pseudoaleatoria. No es tan relevante para este caso.
random.seed(11238)

# Generamos el rango de 10 millones de observaciones
value_range = range(1_000_000)

# implementando una comprensión de lista, generamos una lista de listas que simula el registro
random_database = [list(create_random_row()) for i in value_range]

In [5]:
len(random_database)

1000000

In [6]:
random_database[100]

[69, 117000, 'Unemployed', 'Debt', 'No Churn']

In [7]:
# Usando map

def int_to_row(x):
    return list(create_random_row())

random_database = list(map(int_to_row, value_range))


In [8]:
#len(random_database)

[x for x in range(10)]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

## Ejercicio 2
Desde la gerencia de estudios de la compañía de seguros, le solicitan mejorar la siguiente línea de código:

```python
employment_income_looped = 0
for i in random_database:
    if i[2] == 'Employed':
        employment_income_looped += i[1]
# retorno
#2523162067000
```
Responda los siguientes puntos:
+ ¿Qué retornará la variable employment_income_looped ?
+ ¿Cómo sería una implementación del código utilizando map y filter ?
+ ¿Son iguales los resultados?

### SOLUCIÓN:

La variable employment_income_looped retorna la suma de Income de todos los registros donde employment_status sea igual a 'Employed'

In [9]:
employment_income_looped = 0
for i in random_database:
    if i[2] == 'Employed':
        employment_income_looped += i[1]
        
print(employment_income_looped)

252330389000


Implementación con map y filter 

In [10]:
# Implementamos filter para obtener solo los casos 'Employed'
list_employed = list(filter(lambda x: x[2] == 'Employed', random_database))
list_employed[:10]

[[49, 280000, 'Employed', 'Debt', 'No Churn'],
 [53, 320000, 'Employed', 'Debt', 'No Churn'],
 [45, 275000, 'Employed', 'No Debt', 'Churn'],
 [42, 421000, 'Employed', 'No Debt', 'Churn'],
 [47, 536000, 'Employed', 'Debt', 'Churn'],
 [19, 786000, 'Employed', 'Debt', 'Churn'],
 [43, 697000, 'Employed', 'Debt', 'Churn'],
 [79, 170000, 'Employed', 'Debt', 'Churn'],
 [76, 170000, 'Employed', 'Debt', 'No Churn'],
 [78, 175000, 'Employed', 'Debt', 'No Churn']]

In [11]:
# con map generamos una lista de todos los Income de el listado filtrado anteriormente 
list_income = list(map(lambda x: x[1], list_employed))
list_income[:10]

[280000,
 320000,
 275000,
 421000,
 536000,
 786000,
 697000,
 170000,
 170000,
 175000]

In [12]:
# Se realiza la suma de todos los valores contenidos en la lista income creada anteriormente
employment_income_map = sum(list_income)
employment_income_map

252330389000

Simplificación de la implementación: 

In [13]:
sum_income_employed = sum(list(map(lambda x: x[1], list(filter(lambda x: x[2] == 'Employed', random_database)))))
print(sum_income_employed)

252330389000


## Ejercicio 3:
Desde la gerencia le solicitan mejorar la siguiente línea de código:
```python
count_debts_looped = 0
for i in random_database:
    for j in i:
        if j == 'Debt':
            count_debts_looped += 1
# retorno
#5000335
```
Responda los siguientes puntos:
+ ¿Cuál será el retorno de la variable count_debts_looped ?
+ ¿Cuál es la complejidad algorítmica del código?
+ ¿Cómo sería una implementación del código utilizando map y filter ?
+ ¿Son iguales los resultados de ambas operaciones?

### SOLUCIÓN:

El codigo retorna la cantidad de registros cuyo debt_status es igual a 'Debt'
Para ello realiza el recorrido de todas las listas contenidas en random_database, luego por cada lista recorre todos los valores de la lista. En caso de encontrar un valor de la lista igual a 'Debt' suma 1 al contador. 
La complejidad algoritmica del código 

In [14]:
count_debts_looped = 0
for i in random_database:
    for j in i:
        if j == 'Debt':
            count_debts_looped += 1
print(count_debts_looped)

500019


Implementación con map y filter 

In [15]:
count_debt = len(list(filter(lambda x: x[3]== 'Debt', random_database)))
print(count_debt)

500019


## Ejercicio 4
Desde la gerencia le solicitan mejorar la siguiente línea de código:

```python
churn_subset, no_churn_subset = [], []
for i in random_database:
    for j in i:
        if i == 'Churn':
            churn_subset.append(i)
    for j in i:
        if i == 'No Churn':
            no_churn.append(i)
```
+ ¿Cuál será el retorno de la variable churn_subset y no_churn_subset ?
+ ¿Cuál es la complejidad algorítmica del código?
+ ¿Cómo sería una implementación del código utilizando map y filter ?
+ ¿Son iguales los resultados de ambas operaciones?
+ Estime la media, la varianza, el mínimo y el máximo de la edad para ambos subsets, sin utilizar librerías externas.

### SOLUCIÓN:

El código retorna dos subset, uno para los casos donde churn_status sea igual a 'Churn' y otro donde churn_status es igual a 'No Churn'

In [16]:
churn_subset, no_churn_subset = [], []
for i in random_database:
    for j in i:
        if j == 'Churn':
            churn_subset.append(i)
    for j in i:
        if j == 'No Churn':
            no_churn_subset.append(i)

Implementado filter para obtener los subset

In [17]:
churn_set = list(filter(lambda x: x[4] == 'Churn', random_database))
no_churn_set = list(filter(lambda x: x[4] == 'No Churn', random_database))

Implementado map para obtener la lista de edades

In [18]:
churn_age = list(map(lambda x: x[0], churn_set))
no_churn_age = list(map(lambda x: x[0], no_churn_set))

Retornar media, varianza, maximo y minimo de edad

In [19]:
def media(valores):
    return sum(valores)/len(valores)

In [20]:
media_churn = media(churn_age)
media_no_churn = media(no_churn_age)
print('media edad churn', media_churn)
print('media edad no churn', media_no_churn)

media edad churn 53.98087610984183
media edad no churn 54.00687988099665


In [21]:
def varianza(valores, media):
    return sum(list(map(lambda x: (x-media)**2, valores )))/(len(valores)-1)

In [22]:
var_churn = varianza(churn_age, media_churn)
var_no_churn = varianza(no_churn_age, media_no_churn)
print('varianza edad churn', var_churn)
print('varianza edad no churn', var_no_churn)

varianza edad churn 442.7378971374518
varianza edad no churn 443.679268793223


In [24]:
print('Minimo edad churn', min(churn_age))
print('Minimo edad no churn', min(no_churn_age))

Minimo edad churn 18
Minimo edad no churn 18


In [25]:
print('Maximo edad churn', max(churn_age))
print('Maximo edad no churn', max(no_churn_age))

Maximo edad churn 90
Maximo edad no churn 90


## Ejercicio 5:
Desde la gerencia le solicitan mejorar la siguiente línea de código:
```python
unemployed_debt_churn = 0
unemployed_nodebt_churn = 0
unemployed_debt_nochurn = 0
unemployed_nodebt_nochurn = 0
employed_debt_churn = 0
employed_nodebt_churn = 0
employed_debt_nochurn = 0
employed_nodebt_nochurn = 0

for i in random_database:
    if i[2] == 'Unemployed' and i[3] == 'Debt' and i[4] == 'Churn':
        unemployed_debt_churn += 1
    if i[2] == 'Unemployed' and i[3] == 'No Debt' and i[4] == 'Churn':
        unemployed_nodebt_churn += 1
    if i[2] == 'Unemployed' and i[3] == 'Debt' and i[4] == 'No Churn':
        unemployed_debt_nochurn += 1
    if i[2] == 'Unemployed' and i[3] == 'No Debt' and i[4] == 'No Churn':
        unemployed_nodebt_nochurn += 1
    if i[2] == 'Employed' and i[3] == 'Debt' and i[4] == 'Churn':
        employed_debt_churn += 1
    if i[2] == 'Employed' and i[3] == 'No Debt' and i[4] == 'Churn':
        employed_nodebt_churn += 1
    if i[2] == 'Employed' and i[3] == 'Debt' and i[4] == 'No Churn':
        employed_debt_nochurn += 1
    if i[2] == 'Employed' and i[3] == 'No Debt' and i[4] == 'No Churn':
        employed_nodebt_nochurn += 1

print("Unemployed, Debt, Churn: ", unemployed_debt_churn)
print("Unemployed, No Debt, Churn: ", unemployed_nodebt_churn)
print("Unemployed, Debt, No Churn: ", unemployed_debt_nochurn)
print("Unemployed, No Debt, No Churn: ", unemployed_nodebt_nochurn)
print("Employed, Debt, Churn: ", employed_debt_churn)
print("Employed, No Debt, Churn: ", employed_nodebt_churn)
print("Employed, Debt, No Churn: ", employed_debt_nochurn)
print("Employed, No Debt, No Churn: ", employed_nodebt_nochurn)

# retorno
# Unemployed, Debt, Churn: 1249114
# Unemployed, No Debt, Churn: 1250165
# Unemployed, Debt, No Churn: 1251163
# Unemployed, No Debt, No Churn: 1249760
# Employed, Debt, Churn: 1249421
# Employed, No Debt, Churn: 1250581
# Employed, Debt, No Churn: 1248184
# Employed, No Debt, No Churn: 1251612
```    
+ ¿Cómo sería una implementación utilizando map ?
+ ¿Son iguales los resultados de ambas operaciones?

In [26]:
unemployed_debt_churn = 0
unemployed_nodebt_churn = 0
unemployed_debt_nochurn = 0
unemployed_nodebt_nochurn = 0
employed_debt_churn = 0
employed_nodebt_churn = 0
employed_debt_nochurn = 0
employed_nodebt_nochurn = 0

for i in random_database:
    if i[2] == 'Unemployed' and i[3] == 'Debt' and i[4] == 'Churn':
        unemployed_debt_churn += 1
    if i[2] == 'Unemployed' and i[3] == 'No Debt' and i[4] == 'Churn':
        unemployed_nodebt_churn += 1
    if i[2] == 'Unemployed' and i[3] == 'Debt' and i[4] == 'No Churn':
        unemployed_debt_nochurn += 1
    if i[2] == 'Unemployed' and i[3] == 'No Debt' and i[4] == 'No Churn':
        unemployed_nodebt_nochurn += 1
    if i[2] == 'Employed' and i[3] == 'Debt' and i[4] == 'Churn':
        employed_debt_churn += 1
    if i[2] == 'Employed' and i[3] == 'No Debt' and i[4] == 'Churn':
        employed_nodebt_churn += 1
    if i[2] == 'Employed' and i[3] == 'Debt' and i[4] == 'No Churn':
        employed_debt_nochurn += 1
    if i[2] == 'Employed' and i[3] == 'No Debt' and i[4] == 'No Churn':
        employed_nodebt_nochurn += 1

print("Unemployed, Debt, Churn: ", unemployed_debt_churn)
print("Unemployed, No Debt, Churn: ", unemployed_nodebt_churn)
print("Unemployed, Debt, No Churn: ", unemployed_debt_nochurn)
print("Unemployed, No Debt, No Churn: ", unemployed_nodebt_nochurn)
print("Employed, Debt, Churn: ", employed_debt_churn)
print("Employed, No Debt, Churn: ", employed_nodebt_churn)
print("Employed, Debt, No Churn: ", employed_debt_nochurn)
print("Employed, No Debt, No Churn: ", employed_nodebt_nochurn)

Unemployed, Debt, Churn:  125147
Unemployed, No Debt, Churn:  125063
Unemployed, Debt, No Churn:  125032
Unemployed, No Debt, No Churn:  124878
Employed, Debt, Churn:  124941
Employed, No Debt, Churn:  124695
Employed, Debt, No Churn:  124899
Employed, No Debt, No Churn:  125345


In [40]:
unemployed_debt_churn_count = len(list(filter(lambda x: x[2]=='Unemployed' and x[3] == 'Debt' and x[4] == 'Churn', random_database)))
unemployed_nodebt_churn_count = len(list(filter(lambda x: x[2]=='Unemployed' and x[3] == 'No Debt' and x[4] == 'Churn', random_database)))
unemployed_debt_nochurn_count = len(list(filter(lambda x: x[2]=='Unemployed' and x[3] == 'Debt' and x[4] == 'No Churn', random_database)))
unemployed_nodebt_nochurn_count = len(list(filter(lambda x: x[2]=='Unemployed' and x[3] == 'No Debt' and x[4] == 'No Churn', random_database)))
employed_debt_churn_count = len(list(filter(lambda x: x[2]=='Employed' and x[3] == 'Debt' and x[4] == 'Churn', random_database)))
employed_nodebt_churn_count = len(list(filter(lambda x: x[2]=='Employed' and x[3] == 'No Debt' and x[4] == 'Churn', random_database)))
employed_debt_nochurn_count = len(list(filter(lambda x: x[2]=='Employed' and x[3] == 'Debt' and x[4] == 'No Churn', random_database)))
employed_nodebt_nochurn_count = len(list(filter(lambda x: x[2]=='Employed' and x[3] == 'No Debt' and x[4] == 'No Churn', random_database)))

print("Unemployed, Debt, Churn: ", unemployed_debt_churn_count)
print("Unemployed, No Debt, Churn: ", unemployed_nodebt_churn_count)
print("Unemployed, Debt, No Churn: ", unemployed_debt_nochurn_count)
print("Unemployed, No Debt, No Churn: ", unemployed_nodebt_nochurn_count)
print("Employed, Debt, Churn: ", employed_debt_churn_count)
print("Employed, No Debt, Churn: ", employed_nodebt_churn_count)
print("Employed, Debt, No Churn: ", employed_debt_nochurn_count)
print("Employed, No Debt, No Churn: ", employed_nodebt_nochurn_count)

Unemployed, Debt, Churn:  125147
Unemployed, No Debt, Churn:  125063
Unemployed, Debt, No Churn:  125032
Unemployed, No Debt, No Churn:  124878
Employed, Debt, Churn:  124941
Employed, No Debt, Churn:  124695
Employed, Debt, No Churn:  124899
Employed, No Debt, No Churn:  125345
