<a href="https://colab.research.google.com/github/brunofbpaula/DataScience-UM-Coursera/blob/main/Numpy/Datasets_Numpy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Datasets in Numpy
To load a dataset in Numpy, we can use the genfromtext() function. We can specify data file name, delimiter (which is optional but often used), and number of rows to skip if we have a header row, hence it is one here.

The genfromtext() function has a parameter called dtype for specifying data types of each column, and this parameter is optional. Without specifying the types, all types will be casted as the same to the more general/precise type.

In [1]:
import numpy as np
import math

## Wine Quality
Here we have a very popular dataset on wine quality, and we are going to look at red wines. The data fields include: fixed acidity, volatile aciditycitric acid, residual sugar, chlorides, free sulfur dioxide, total dioxide density, pH, sulphates, alcohol, quality and so on.

In [2]:
wines = np.genfromtxt("dataset/winequality-red.csv", delimiter=";", skip_header=1)
wines

array([[ 7.4  ,  0.7  ,  0.   , ...,  0.56 ,  9.4  ,  5.   ],
       [ 7.8  ,  0.88 ,  0.   , ...,  0.68 ,  9.8  ,  5.   ],
       [ 7.8  ,  0.76 ,  0.04 , ...,  0.65 ,  9.8  ,  5.   ],
       ...,
       [ 6.3  ,  0.51 ,  0.13 , ...,  0.75 , 11.   ,  6.   ],
       [ 5.9  ,  0.645,  0.12 , ...,  0.71 , 10.2  ,  5.   ],
       [ 6.   ,  0.31 ,  0.47 , ...,  0.66 , 11.   ,  6.   ]])

In [3]:
# Getting the acidity column (first column)
# Remembering that for multidimensional arrays, the first argument refers to the row, and the second one to the column
# and if we just give one argument then we'll get a single dimensional list back.

print(f'One integer \'0\' for slicing: ', wines[:, 0])  # All rows combined horizontally
print(f'0 to 1 for slicing: \n', wines[:, 0:1])  # Same values but each one in their own rows vertically


One integer '0' for slicing:  [7.4 7.8 7.8 ... 6.3 5.9 6. ]
0 to 1 for slicing: 
 [[7.4]
 [7.8]
 [7.8]
 ...
 [6.3]
 [5.9]
 [6. ]]


In [4]:
# Non-consecutive columns
# We simply need to pass the indices of the columns into an array and then the array as the second argument
wines[:, [0,2,4]]

array([[7.4  , 0.   , 0.076],
       [7.8  , 0.   , 0.098],
       [7.8  , 0.04 , 0.092],
       ...,
       [6.3  , 0.13 , 0.076],
       [5.9  , 0.12 , 0.075],
       [6.   , 0.47 , 0.067]])

In [5]:
# Finding the average quality of red wine
# The last column

print(f"The average quality of red wine is {wines[:, -1].mean():.2f}.")

The average quality of red wine is 5.64.


In [6]:
# Getting the rows where the alcohol percentage is higher than thirteen
high_alcohol = wines[wines[:,-2] > 13]

# Now getting the average quality of red wine with high alcohol content
print(f"The average quality of high-alcohol-content red wine is {high_alcohol[:, -1].mean():.2f}.")

The average quality of high-alcohol-content red wine is 6.52.


## Admission Predict

This dataset contains fields such as GRE score, TOEFL score, university rating, GPA, having research or not, and a chance of admission.

In [7]:
# It's possible to specify the data field names when using genfromtext() to loads CSV data.
# We can also have numpy try and infer the type of a column by setting the dtype parameter to None.

graduate_admission = np.genfromtxt('dataset/Admission_Predict.csv', dtype=None, delimiter=",", skip_header=1,
                                   names=('Serial No', 'GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR',
                                          'CGPA', 'Research', 'Chance of Admit'))

# The result is an one-dimensional array with 400 tuples
graduate_admission.shape

(400,)

In [8]:
# Now that we've given all columns a name, it's easier to retrieve a column.
# We just need to use the column's name
# Remembering that spaces are replaced with underscores
graduate_admission['TOEFL_Score'][0:10]

array([118, 107, 104, 110, 103, 115, 109, 101, 102, 108])

In [9]:
# Let's find out how many students have done a research
print("The amount of students that have done research is {}.".format(len(graduate_admission[graduate_admission['Research']==1])))

The amount of students that have done research is 219.


In [10]:
# Do students with high chances of admission (>0.8) on average have higher GRE Score than those with
# lower chances of admission? Let's see.

print(graduate_admission[graduate_admission['Chance_of_Admit'] > 0.8]['GRE_Score'].mean())  # Students with high chance of admission
graduate_admission[graduate_admission['Chance_of_Admit'] < 0.4]['GRE_Score'].mean()  # Students with low chance of admission

328.7350427350427


302.2857142857143