# <en><center>Dataset Basics</center></en>

![Flare.png](attachment:697b1f43-0bfb-49f1-99e8-3112cf7ae3b8.png)

## Table of Contents

- [Loading Datasets](#Loading-Datasets)
- [Numpy vs Pandas](#Numpy-vs-Pandas)

## Library

In [1]:
import numpy as np
import pandas as pd
import pickle

## Loading Datasets

![Flare.png](attachment:697b1f43-0bfb-49f1-99e8-3112cf7ae3b8.png)

### Dataset

In [2]:
heart_file = "C:\\Users\\pyria\\OneDrive\\Documents\\Personal Development\\Python\\Bootcamps\\\
Pandas Bootcamp\\P87-Section-2-Dataset-Basics-Resources\\heart.csv"

#### The best method - pandas read_csv

In [4]:
df = pd.read_csv(heart_file)
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


### Numpy loadtext and genfromtxt

- It fails without extra arguments
- It is not as smart and we have to tell it what to do.
- It is designed for loading in data saved using np.savetxt.
- It was never meant to be used as a robust data loader.

![Flare.png](attachment:697b1f43-0bfb-49f1-99e8-3112cf7ae3b8.png)

In [6]:
data = np.loadtxt(heart_file, delimiter = ",", skiprows = 1)
print(data)

[[63.  1.  3. ...  0.  1.  1.]
 [37.  1.  2. ...  0.  2.  1.]
 [41.  0.  1. ...  0.  2.  1.]
 ...
 [68.  1.  0. ...  2.  3.  0.]
 [57.  1.  0. ...  1.  3.  0.]
 [57.  0.  1. ...  1.  2.  0.]]


In [5]:
data = np.genfromtxt(heart_file, delimiter = ",", dtype = None, names = True, encoding = "utf-8-sig")
print(data)
print(data.dtype)

[(63, 1, 3, 145, 233, 1, 0, 150, 0, 2.3, 0, 0, 1, 1)
 (37, 1, 2, 130, 250, 0, 1, 187, 0, 3.5, 0, 0, 2, 1)
 (41, 0, 1, 130, 204, 0, 0, 172, 0, 1.4, 2, 0, 2, 1)
 (56, 1, 1, 120, 236, 0, 1, 178, 0, 0.8, 2, 0, 2, 1)
 (57, 0, 0, 120, 354, 0, 1, 163, 1, 0.6, 2, 0, 2, 1)
 (57, 1, 0, 140, 192, 0, 1, 148, 0, 0.4, 1, 0, 1, 1)
 (56, 0, 1, 140, 294, 0, 0, 153, 0, 1.3, 1, 0, 2, 1)
 (44, 1, 1, 120, 263, 0, 1, 173, 0, 0. , 2, 0, 3, 1)
 (52, 1, 2, 172, 199, 1, 1, 162, 0, 0.5, 2, 0, 3, 1)
 (57, 1, 2, 150, 168, 0, 1, 174, 0, 1.6, 2, 0, 2, 1)
 (54, 1, 0, 140, 239, 0, 1, 160, 0, 1.2, 2, 0, 2, 1)
 (48, 0, 2, 130, 275, 0, 1, 139, 0, 0.2, 2, 0, 2, 1)
 (49, 1, 1, 130, 266, 0, 1, 171, 0, 0.6, 2, 0, 2, 1)
 (64, 1, 3, 110, 211, 0, 0, 144, 1, 1.8, 1, 0, 2, 1)
 (58, 0, 3, 150, 283, 1, 0, 162, 0, 1. , 2, 0, 2, 1)
 (50, 0, 2, 120, 219, 0, 1, 158, 0, 1.6, 1, 0, 2, 1)
 (58, 0, 2, 120, 340, 0, 1, 172, 0, 0. , 2, 0, 2, 1)
 (66, 0, 3, 150, 226, 0, 1, 114, 0, 2.6, 0, 0, 2, 1)
 (43, 1, 0, 150, 247, 0, 1, 171, 0, 1.5, 2, 0,

### Manual Loading
- Gives weird file structures

In [8]:
def load_file(filename):
    with open(heart_file, encoding = "utf-8-sig") as f:
        data, cols = [], []
        for i, line in enumerate(f.read().splitlines()):
            if i == 0:
                cols += line.split(",")
            else:
                data.append([float(x) for x in line.split(",")])
        df = pd.DataFrame(data, columns = cols)
    return df
load_file(heart_file).head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63.0,1.0,3.0,145.0,233.0,1.0,0.0,150.0,0.0,2.3,0.0,0.0,1.0,1.0
1,37.0,1.0,2.0,130.0,250.0,0.0,1.0,187.0,0.0,3.5,0.0,0.0,2.0,1.0
2,41.0,0.0,1.0,130.0,204.0,0.0,0.0,172.0,0.0,1.4,2.0,0.0,2.0,1.0
3,56.0,1.0,1.0,120.0,236.0,0.0,1.0,178.0,0.0,0.8,2.0,0.0,2.0,1.0
4,57.0,0.0,0.0,120.0,354.0,0.0,1.0,163.0,1.0,0.6,2.0,0.0,2.0,1.0


### Pickles!

In [9]:
pickle_file = "C:\\Users\\pyria\\OneDrive\\Documents\\Personal Development\\Python\\Bootcamps\\Pandas Bootcamp\\\
P87-Section-2-Dataset-Basics-Resources\\heart.pkl"

df = pd.read_pickle(pickle_file)
df.head

<bound method NDFrame.head of      age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  \
0     63    1   3       145   233    1        0      150      0      2.3   
1     37    1   2       130   250    0        1      187      0      3.5   
2     41    0   1       130   204    0        0      172      0      1.4   
3     56    1   1       120   236    0        1      178      0      0.8   
4     57    0   0       120   354    0        1      163      1      0.6   
..   ...  ...  ..       ...   ...  ...      ...      ...    ...      ...   
298   57    0   0       140   241    0        1      123      1      0.2   
299   45    1   3       110   264    0        1      132      0      1.2   
300   68    1   0       144   193    1        1      141      0      3.4   
301   57    1   0       130   131    0        1      115      1      1.2   
302   57    0   1       130   236    0        0      174      0      0.0   

     slope  ca  thal  target  
0        0   0     1      

### Loading Data Summary

- Use Pandas pd.read_csv 99% of the time.
- Use pd.read_* for other cases(pd.read_excel, pd.read_pickle, etc)
- If pd can't handle it, numpy probably can't.
- If you use a manual function, save your data to a sensible format.

## Numpy vs Pandas

1. Pandas has a numpy core.
2. Extra structure and tools, but sometimes you have to strip it away.

![Flare.png](attachment:697b1f43-0bfb-49f1-99e8-3112cf7ae3b8.png)

In [10]:
df = pd.read_csv(heart_file)

In [11]:
data = df.to_numpy()
data = df.values
data

array([[63.,  1.,  3., ...,  0.,  1.,  1.],
       [37.,  1.,  2., ...,  0.,  2.,  1.],
       [41.,  0.,  1., ...,  0.,  2.,  1.],
       ...,
       [68.,  1.,  0., ...,  2.,  3.,  0.],
       [57.,  1.,  0., ...,  1.,  3.,  0.],
       [57.,  0.,  1., ...,  1.,  2.,  0.]])

In [12]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [13]:
print(data.dtype, data)

float64 [[63.  1.  3. ...  0.  1.  1.]
 [37.  1.  2. ...  0.  2.  1.]
 [41.  0.  1. ...  0.  2.  1.]
 ...
 [68.  1.  0. ...  2.  3.  0.]
 [57.  1.  0. ...  1.  3.  0.]
 [57.  0.  1. ...  1.  2.  0.]]


In [15]:
data[0,0] = 100
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [17]:
df2 = df[["age", "sex", "cp"]]
data2 = df2.to_numpy().copy() # you need to get a copy of the array
data2[0,0]= 100
df2

Unnamed: 0,age,sex,cp
0,100,1,3
1,37,1,2
2,41,0,1
3,56,1,1
4,57,0,0
...,...,...,...
298,57,0,0
299,45,1,3
300,68,1,0
301,57,1,0


In [18]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [20]:
print(df["age"].quantile(0.5))

55.0


In [21]:
print(df["age"].to_numpy().reshape((3,-1)))

[[63 37 41 56 57 57 56 44 52 57 54 48 49 64 58 50 58 66 43 69 59 44 42 61
  40 71 59 51 65 53 41 65 44 54 51 46 54 54 65 65 51 48 45 53 39 52 44 47
  53 53 51 66 62 44 63 52 48 45 34 57 71 54 52 41 58 35 51 45 44 62 54 51
  29 51 43 55 51 59 52 58 41 45 60 52 42 67 68 46 54 58 48 57 52 54 45 53
  62 52 43 53 42]
 [59 63 42 50 68 69 45 50 50 64 57 64 43 55 37 41 56 46 46 64 59 41 54 39
  34 47 67 52 74 54 49 42 41 41 49 60 62 57 64 51 43 42 67 76 70 44 60 44
  42 66 71 64 66 39 58 47 35 58 56 56 55 41 38 38 67 67 62 63 53 56 48 58
  58 60 40 60 64 43 57 55 65 61 58 50 44 60 54 50 41 51 58 54 60 60 59 46
  67 62 65 44 60]
 [58 68 62 52 59 60 49 59 57 61 39 61 56 43 62 63 65 48 63 55 65 56 54 70
  62 35 59 64 47 57 55 64 70 51 58 60 77 35 70 59 64 57 56 48 56 66 54 69
  51 43 62 67 59 45 58 50 62 38 66 52 53 63 54 66 55 49 54 56 46 61 67 58
  47 52 58 57 58 61 42 52 59 40 61 46 59 57 57 55 61 58 58 67 44 63 63 59
  57 45 68 57 57]]


Most of the time, better to keep things in DataFrame format, as you can do more.

## Creating DataFrames

![Flare.png](attachment:697b1f43-0bfb-49f1-99e8-3112cf7ae3b8.png)

In [25]:
data = np.random.random(size = (5,3))
print(data)
df = pd.DataFrame(data = data, columns = ["A", "B","C"])
df

[[0.25219688 0.5293432  0.45144641]
 [0.56515903 0.71756664 0.21117433]
 [0.03329659 0.34016218 0.68356936]
 [0.3091201  0.94986377 0.49021003]
 [0.1056806  0.3951101  0.72983312]]


Unnamed: 0,A,B,C
0,0.252197,0.529343,0.451446
1,0.565159,0.717567,0.211174
2,0.033297,0.340162,0.683569
3,0.30912,0.949864,0.49021
4,0.105681,0.39511,0.729833


Create a DataFrame with a dictionary

In [26]:
df = pd.DataFrame({"A": [1,2,3], "B": ["Sam","Alex","John"]})
df

Unnamed: 0,A,B
0,1,Sam
1,2,Alex
2,3,John


In [30]:
data = [{"A": 1, "B": "Sam"},{"A": 2, "B": "Alex"}, {"A":3, "B":"John"}]
df = pd.DataFrame(data)
df

Unnamed: 0,A,B
0,1,Sam
1,2,Alex
2,3,John


Create a DataFrame with a numpy structure

In [28]:
dtype = [("A", np.int),("B", (np.str, 20))]
data = np.array([(1, "Sam"), (2, "Alex"), (3,"John")], dtype = dtype)
df = pd.DataFrame(data)
df

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = [("A", np.int),("B", (np.str, 20))]
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype = [("A", np.int),("B", (np.str, 20))]


Unnamed: 0,A,B
0,1,Sam
1,2,Alex
2,3,John


## Saving and Serializing a DataFrame

In [36]:
df = pd.DataFrame(np.random.random(size = (100000, 4)), columns = ["A", "B","C","D"])
df.head()

Unnamed: 0,A,B,C,D
0,0.801048,0.328091,0.386883,0.185116
1,0.279739,0.684165,0.180335,0.645072
2,0.64632,0.390665,0.639835,0.729768
3,0.097497,0.735112,0.411006,0.903781
4,0.624882,0.666971,0.509083,0.173215


In [None]:
df.to_csv("save.csv", index = False, float_format = "%0,4f")

In [None]:
df.to_pickle("save.pkl")

In [None]:
df.to_hdf("save.hdf", key = "data", format = "table")

In [None]:
df.to_feather("save.fth")

## Inspecting Data

In [37]:
dataset = "C:\\Users\\pyria\\OneDrive\\Documents\\Personal Development\\\
Portfolio Projects\\Data Sets\\Astronaut Dataset\\astronauts.csv"

df = pd.read_csv(dataset)

In [38]:
df.head(2)

Unnamed: 0,Name,Year,Group,Status,Birth Date,Birth Place,Gender,Alma Mater,Undergraduate Major,Graduate Major,Military Rank,Military Branch,Space Flights,Space Flight (hr),Space Walks,Space Walks (hr),Missions,Death Date,Death Mission
0,Joseph M. Acaba,2004.0,19.0,Active,5/17/1967,"Inglewood, CA",Male,University of California-Santa Barbara; Univer...,Geology,Geology,,,2,3307,2,13.0,"STS-119 (Discovery), ISS-31/32 (Soyuz)",,
1,Loren W. Acton,,,Retired,3/7/1936,"Lewiston, MT",Male,Montana State University; University of Colorado,Engineering Physics,Solar Physics,,,1,190,0,0.0,STS 51-F (Challenger),,


In [39]:
df.tail()

Unnamed: 0,Name,Year,Group,Status,Birth Date,Birth Place,Gender,Alma Mater,Undergraduate Major,Graduate Major,Military Rank,Military Branch,Space Flights,Space Flight (hr),Space Walks,Space Walks (hr),Missions,Death Date,Death Mission
352,David A. Wolf,1990.0,13.0,Retired,8/23/1956,"Indianapolis, IN",Male,Purdue University; Indiana University,Electrical Engineering,Medicine,,,3,4044,7,41.0,STS-58 (Columbia). STS-86/89 (Atlantis/Endeavo...,,
353,Neil W. Woodward III,1998.0,17.0,Retired,7/26/1962,"Chicago, IL",Male,MIT; University of Texas-Austin; George Washin...,Physics,Physics; Business Management,Commander,US Navy,0,0,0,0.0,,,
354,Alfred M. Worden,1966.0,5.0,Retired,2/7/1932,"Jackson, MI",Male,US Military Academy; University of Michigan,Military Science,Aeronautical & Astronautical Engineering,Colonel,US Air Force (Retired),1,295,1,0.5,Apollo 15,,
355,John W. Young,1962.0,2.0,Retired,9/24/1930,"San Francisco, CA",Male,Georgia Institute of Technology,Aeronautical Engineering,,Captain,US Navy (Retired),6,835,3,20.0,"Gemini 3, Gemini 10, Apollo 10, Apollo 16, STS...",,
356,George D. Zamka,1998.0,17.0,Retired,6/29/1962,"Jersey City, NJ",Male,US Naval Academy; Florida Institute of Technology,Mathematics,Engineering Management,Colonel,US Marine Corps (Retired),2,692,0,0.0,"STS-120 (Discovery), STS-130 (Endeavor)",,


In [40]:
df.sample(3)

Unnamed: 0,Name,Year,Group,Status,Birth Date,Birth Place,Gender,Alma Mater,Undergraduate Major,Graduate Major,Military Rank,Military Branch,Space Flights,Space Flight (hr),Space Walks,Space Walks (hr),Missions,Death Date,Death Mission
22,Daniel T. Barry,1992.0,14.0,Retired,12/30/1953,"Norwalk, CT",Male,Cornell University; Princeton University; Univ...,Electrical Engineering,Electrical Engineering; Computer Science; Medi...,,,3,733,4,26.0,"STS-72 (Endeavor), STS-96 (Discovery), STS-105...",,
258,James A. Pawelczyk,,,Retired,9/20/1960,"Buffalo, NY",Male,University of Rochester; Pennsylvania State Un...,Biology & Psychology,Physiology; Biology,,,1,381,0,0.0,STS-90 (Columbia),,
161,David C. Hilmers,1980.0,9.0,Retired,1/28/1950,"Clinton, IA",Male,Cornell University; US Naval Postgraduate School,Mathematics,Electrical Engineering,Colonel,US Marine Corps (Retired),4,494,0,0.0,"ST 51-J (Atlantis), STS-26 (Discovery), STS-36...",,


In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 357 entries, 0 to 356
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Name                 357 non-null    object 
 1   Year                 330 non-null    float64
 2   Group                330 non-null    float64
 3   Status               357 non-null    object 
 4   Birth Date           357 non-null    object 
 5   Birth Place          357 non-null    object 
 6   Gender               357 non-null    object 
 7   Alma Mater           356 non-null    object 
 8   Undergraduate Major  335 non-null    object 
 9   Graduate Major       298 non-null    object 
 10  Military Rank        207 non-null    object 
 11  Military Branch      211 non-null    object 
 12  Space Flights        357 non-null    int64  
 13  Space Flight (hr)    357 non-null    int64  
 14  Space Walks          357 non-null    int64  
 15  Space Walks (hr)     357 non-null    flo

In [43]:
df.describe()

Unnamed: 0,Year,Group,Space Flights,Space Flight (hr),Space Walks,Space Walks (hr)
count,330.0,330.0,357.0,357.0,357.0,357.0
mean,1985.106061,11.409091,2.364146,1249.266106,1.246499,7.707283
std,13.216147,5.149962,1.4287,1896.759857,2.056989,13.367973
min,1959.0,1.0,0.0,0.0,0.0,0.0
25%,1978.0,8.0,1.0,289.0,0.0,0.0
50%,1987.0,12.0,2.0,590.0,0.0,0.0
75%,1996.0,16.0,3.0,1045.0,2.0,12.0
max,2009.0,20.0,7.0,12818.0,10.0,67.0


In [45]:
df.shape

(357, 19)

In [46]:
df.corr()

Unnamed: 0,Year,Group,Space Flights,Space Flight (hr),Space Walks,Space Walks (hr)
Year,1.0,0.980934,0.03642,0.331386,0.210073,0.253502
Group,0.980934,1.0,-0.011386,0.325683,0.217891,0.261384
Space Flights,0.03642,-0.011386,1.0,0.325233,0.257073,0.258642
Space Flight (hr),0.331386,0.325683,0.325233,1.0,0.472796,0.454408
Space Walks,0.210073,0.217891,0.257073,0.472796,1.0,0.985755
Space Walks (hr),0.253502,0.261384,0.258642,0.454408,0.985755,1.0


Value_counts only applies to 1 column at a time.

In [None]:
df["Year"].value_counts()

1978.0    35
1996.0    35
1998.0    25
1990.0    23
1966.0    19
1995.0    19
1980.0    19
1992.0    19
1984.0    18
2000.0    17
1987.0    15
1963.0    14
1985.0    13
2004.0    11
1967.0    11
2009.0     9
1962.0     8
1969.0     7
1959.0     7
1965.0     6
Name: Year, dtype: int64

In [48]:
df.max()

  df.max()


Name                 Yvonne D. Cagle
Year                          2009.0
Group                           20.0
Status                       Retired
Birth Date                  9/9/1952
Birth Place              Yonkers, NY
Gender                          Male
Space Flights                      7
Space Flight (hr)              12818
Space Walks                       10
Space Walks (hr)                67.0
dtype: object

In [49]:
df["Year"].max()

2009.0

Summary

- Head
- Tail
- Sample
- Info
- Describe