# NumPy


NumPy is a popular Python library for scientific computing and numerical operations. It provides powerful tools for working with multi-dimensional arrays and matrices, which are essential for many data science and machine learning applications. NumPy also offers a wide range of mathematical functions and algorithms that can efficiently process large amounts of data. With NumPy, users can easily perform operations such as matrix multiplication, element-wise calculations, and statistical analysis, all while taking advantage of its fast and optimized code.

NumPy is widely used in the data science community and is a foundational library for many other Python data analysis tools. Its ability to work with large datasets efficiently and effectively make it a valuable tool for both data exploration and model building. NumPy also integrates well with other popular Python libraries such as Pandas, Matplotlib, and SciPy, making it an essential part of any data scientist's toolkit. Whether you're a beginner or an advanced user, NumPy can help you perform complex calculations with ease and speed up your data analysis workflow.



Numpy documentation: https://numpy.org/doc/stable/


Numpy is the core library for scientific computing in Python. Foundational Python libraries such as Pandas,  SciPy and matplotlip are built on top of Numpy. 

In [1]:
import numpy as np

### Arrays vs. Python Lists
- Python lists can include different data types whereas, all the elements in a NumPy array must be the same data type. This makes NumPy very efficient: there's no need for NumPy to check the data type of each element in an array since they must all be the same. Having only a single data type also means that a NumPy array takes up less space in memory than the same information would if stored as a Python list.

In [2]:
num1 = [4, 5, 6, 7]
num2 = [7, 8, 9, 10]

In [3]:
sum_list = num1 + num2
sum_list

[4, 5, 6, 7, 7, 8, 9, 10]

In [4]:
a1 = np.array(num1)
a2 = np.array(num2)

a1 + a2

array([11, 13, 15, 17])

In [6]:
combined = np.array([num1, num2])
combined

array([[ 4,  5,  6,  7],
       [ 7,  8,  9, 10]])

In [7]:
combined.shape

(2, 4)

In [8]:
combined.ndim

2

In [9]:
combined.mean()

7.0

In [10]:
np.mean(combined)

7.0

In [11]:
np.mean(combined, axis=0)

array([5.5, 6.5, 7.5, 8.5])

In [12]:
np.mean(combined, axis=1)

array([5.5, 8.5])

### Class Task
- Create an array to store monthly sales for 3 different products over a 12 month period
- Create a 2D array which contains total sales for each month.
- Concatenate total_sales with monthly_sales into a new array called monthly_sales_with_total
- Create a 1D array called avg_monthly_sales, which contains the average sales amount for each month.


In [14]:
sales = np.array([[ 4134, 23925,  8657],
[ 4116, 23875,  9142],
[ 4673, 27197, 10645],
[ 4580, 25637, 10456],
[ 5109, 27995, 11299],
[ 5011, 27419, 10625],
[ 5245, 27305, 10630],
[ 5270, 27760, 11550],
[ 4680, 24988,  9762],
[ 4913, 25802, 10456],
[ 5312, 25405, 13401],
[ 6630, 27797, 18403]])

sales

array([[ 4134, 23925,  8657],
       [ 4116, 23875,  9142],
       [ 4673, 27197, 10645],
       [ 4580, 25637, 10456],
       [ 5109, 27995, 11299],
       [ 5011, 27419, 10625],
       [ 5245, 27305, 10630],
       [ 5270, 27760, 11550],
       [ 4680, 24988,  9762],
       [ 4913, 25802, 10456],
       [ 5312, 25405, 13401],
       [ 6630, 27797, 18403]])

In [15]:
sales.shape

(12, 3)

In [16]:
sales.ndim

2

In [17]:
sales.size

36

In [24]:
# Average sales
a = np.mean(sales, axis=1)
a

array([12238.66666667, 12377.66666667, 14171.66666667, 13557.66666667,
       14801.        , 14351.66666667, 14393.33333333, 14860.        ,
       13143.33333333, 13723.66666667, 14706.        , 17610.        ])

In [27]:
a.reshape(12,1)

array([[12238.66666667],
       [12377.66666667],
       [14171.66666667],
       [13557.66666667],
       [14801.        ],
       [14351.66666667],
       [14393.33333333],
       [14860.        ],
       [13143.33333333],
       [13723.66666667],
       [14706.        ],
       [17610.        ]])

In [21]:
np.sum(sales, axis=1)

array([36716, 37133, 42515, 40673, 44403, 43055, 43180, 44580, 39430,
       41171, 44118, 52830])

In [22]:
sales.reshape

<function ndarray.reshape>

### Creating arrays using built-in functions
- np.zeros()
- np.ones()
- np.arange()
- np.random - Rand, Randn, Randint

## Pandas

Pandas is a powerful and widely used Python library for data manipulation and analysis. It provides tools for working with structured data, such as tabular data in the form of tables or spreadsheets, and time series data. Pandas allows users to read in data from various file formats, such as CSV or Excel files, and manipulate it in many ways, including filtering, sorting, grouping, and aggregating data. Additionally, Pandas provides powerful tools for data cleaning, handling missing data, and data visualization.

Pandas is built on top of NumPy and is dependent on it. NumPy provides the underlying data structure for Pandas to work with, specifically, the ndarray (N-dimensional array), which is a powerful data structure for performing fast and efficient numerical computations.

Pandas is a must-have tool for any data scientist or analyst working with data in Python. It provides a user-friendly interface for working with complex data structures and makes data analysis more accessible and efficient. With its powerful data manipulation and transformation capabilities, Pandas has become a popular tool in many industries, including finance, healthcare, and retail. Whether you're analyzing large datasets or working with smaller, more structured data, Pandas provides a versatile set of tools to help you manipulate and analyze your data with ease.

Pandas documentation: https://pandas.pydata.org/docs/

In [28]:
import pandas as pd

In [29]:
# Convert array into DataFrame
pd.DataFrame(data=sales, columns=['Product1', 'Product2', 'Product3'], index=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sept', 'Oct', 'Nov', 'Dec'])

Unnamed: 0,Product1,Product2,Product3
Jan,4134,23925,8657
Feb,4116,23875,9142
Mar,4673,27197,10645
Apr,4580,25637,10456
May,5109,27995,11299
Jun,5011,27419,10625
Jul,5245,27305,10630
Aug,5270,27760,11550
Sept,4680,24988,9762
Oct,4913,25802,10456


#### Class Task:

Given two tables above from a school students' report, answer the following questions<br>
- Who is the oldest in the class?<br>
- Who was admitted last?<br>
- How many male students do we have in the record?<br>
- Age of students that are over 20 <br>
- Average age of female students<br>
- Get the Fullname of the oldest student?

In [30]:
data1 = {
         'Id': [202,212,311,312],
        'Admission year': ['01/01/2000','01/02/2010','01/03/2017','01/04/2022'],
        'Gender': ['M','F','M','F'],
        'Name': ['Bob','Sandy','John','Jill'],
        'Age': [20,42,18,25]

        }
data2 = {'Id': [202,212,311,312,502],
        'Full Name': ['Bob Mike','Sandy Kane','John Bull','Jill Ana','Joy Eleve'],
        'House': ['Blue','Green','Yellow','Brown','indigo']
        }

In [32]:
today = {
    'Names': ['Abiola', 'Chidozie', 'Ololade', 'Nancy'],
    'Time': ['5pm', '6pm', '4pm', '7pm']
}

In [33]:
today['Names']

['Abiola', 'Chidozie', 'Ololade', 'Nancy']

In [34]:
today['Time']

['5pm', '6pm', '4pm', '7pm']

In [36]:
pd.DataFrame(today)

Unnamed: 0,Names,Time
0,Abiola,5pm
1,Chidozie,6pm
2,Ololade,4pm
3,Nancy,7pm


In [37]:
df1 = pd.DataFrame(data1)
df1

Unnamed: 0,Id,Admission year,Gender,Name,Age
0,202,01/01/2000,M,Bob,20
1,212,01/02/2010,F,Sandy,42
2,311,01/03/2017,M,John,18
3,312,01/04/2022,F,Jill,25


In [38]:
df2 = pd.DataFrame(data2)
df2

Unnamed: 0,Id,Full Name,House
0,202,Bob Mike,Blue
1,212,Sandy Kane,Green
2,311,John Bull,Yellow
3,312,Jill Ana,Brown
4,502,Joy Eleve,indigo


In [43]:
# The oldest in class
old = df1['Age'].max()
old

42

In [44]:
df1[df1['Age'] == old]

Unnamed: 0,Id,Admission year,Gender,Name,Age
1,212,01/02/2010,F,Sandy,42


In [50]:
df1[df1['Age'] == 42][['Name','Age']]

Unnamed: 0,Name,Age
1,Sandy,42


In [55]:
# Who was admitted last
df1['Admission year'].max()
df1[df1['Admission year']=='01/04/2022']

Unnamed: 0,Id,Admission year,Gender,Name,Age
3,312,01/04/2022,F,Jill,25


In [None]:
# Average age of female student


### Inspecting a DataFrame

When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. There are several useful methods and attributes for this.

- head() - returns the first few rows (the “head” of the DataFrame).
- info() - shows information on each of the columns, such as the data type and number of missing values.
- shape - returns the number of rows and columns of the DataFrame.
- describe() - calculates a few summary statistics for each column.
- columns: An index of columns: the column names.


In [54]:
df = pd.read_csv('Health_Data.csv')
df.head()

Unnamed: 0,General_Health,Checkup,Exercise,Heart_Disease,Skin_Cancer,Other_Cancer,Depression,Diabetes,Arthritis,Sex,Age_Category,Height_(cm),Weight_(kg),BMI,Smoking_History,Alcohol_Consumption,Fruit_Consumption,Green_Vegetables_Consumption,FriedPotato_Consumption
0,Poor,Within the past 2 years,No,No,No,No,No,No,Yes,Female,70-74,150,32.66,14.54,Yes,0,30,16,12
1,Very Good,Within the past year,No,Yes,No,No,No,Yes,No,Female,70-74,165,77.11,28.29,No,0,30,0,4
2,Very Good,Within the past year,Yes,No,No,No,No,Yes,No,Female,60-64,163,88.45,33.47,No,4,12,3,16
3,Poor,Within the past year,Yes,Yes,No,No,No,Yes,No,Male,75-79,180,93.44,28.73,No,0,30,30,8
4,Good,Within the past year,No,No,No,No,No,No,No,Male,80+,191,88.45,24.37,Yes,0,8,4,0


In [59]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 308854 entries, 0 to 308853
Data columns (total 19 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   General_Health                308854 non-null  object 
 1   Checkup                       308854 non-null  object 
 2   Exercise                      308854 non-null  object 
 3   Heart_Disease                 308854 non-null  object 
 4   Skin_Cancer                   308854 non-null  object 
 5   Other_Cancer                  308854 non-null  object 
 6   Depression                    308854 non-null  object 
 7   Diabetes                      308854 non-null  object 
 8   Arthritis                     308854 non-null  object 
 9   Sex                           308854 non-null  object 
 10  Age_Category                  308854 non-null  object 
 11  Height_(cm)                   308854 non-null  int64  
 12  Weight_(kg)                   308854 non-nul

In [57]:
df.shape

(308854, 19)

In [65]:
# Numerical variables only
df.describe()

Unnamed: 0,Height_(cm),Weight_(kg),BMI,Alcohol_Consumption,Fruit_Consumption,Green_Vegetables_Consumption,FriedPotato_Consumption
count,308854.0,308854.0,308854.0,308854.0,308854.0,308854.0,308854.0
mean,170.615249,83.588655,28.626211,5.096366,29.8352,15.110441,6.296616
std,10.658026,21.34321,6.522323,8.199763,24.875735,14.926238,8.582954
min,91.0,24.95,12.02,0.0,0.0,0.0,0.0
25%,163.0,68.04,24.21,0.0,12.0,4.0,2.0
50%,170.0,81.65,27.44,1.0,30.0,12.0,4.0
75%,178.0,95.25,31.85,6.0,30.0,20.0,8.0
max,241.0,293.02,99.33,30.0,120.0,128.0,128.0


In [64]:
# Includes categorical data
df.describe(include='object')

Unnamed: 0,General_Health,Checkup,Exercise,Heart_Disease,Skin_Cancer,Other_Cancer,Depression,Diabetes,Arthritis,Sex,Age_Category,Smoking_History
count,308854,308854,308854,308854,308854,308854,308854,308854,308854,308854,308854,308854
unique,5,5,2,2,2,2,2,4,2,2,13,2
top,Very Good,Within the past year,Yes,No,No,No,No,No,No,Female,65-69,No
freq,110395,239371,239381,283883,278860,278976,246953,259141,207783,160196,33434,183590


In [61]:
df.columns

Index(['General_Health', 'Checkup', 'Exercise', 'Heart_Disease', 'Skin_Cancer',
       'Other_Cancer', 'Depression', 'Diabetes', 'Arthritis', 'Sex',
       'Age_Category', 'Height_(cm)', 'Weight_(kg)', 'BMI', 'Smoking_History',
       'Alcohol_Consumption', 'Fruit_Consumption',
       'Green_Vegetables_Consumption', 'FriedPotato_Consumption'],
      dtype='object')