# How to create a sample dataframe from scratch

In this notebook, we will explore how to create a sample dataframe manually, to be used in other examples.

## Create from a list of lists
We will build this dataframe row by row, from a list of lists

In [3]:
import pandas as pd

data = [
    ['James', 50, 'Web Developer', 'james.heatfilled@gmail.com'], \
    ['Astrid', 41, 'Data Analyst', 'astridleiland@hotmail.com.com'], \
    ['Louise', 27, 'Cloud Architect', 'louiselane@supercloud.com'], \
    ['Shawn', 36, 'Senior Flow Controller', 'shawnlane@supercloud.com'], \
    ]

df = pd.DataFrame(data, columns=['Name', 'Age', 'Position', 'email'])

In [4]:
df.head()


Unnamed: 0,Name,Age,Position,email
0,James,50,Web Developer,james.heatfilled@gmail.com
1,Astrid,41,Data Analyst,astridleiland@hotmail.com.com
2,Louise,27,Cloud Architect,louiselane@supercloud.com
3,Shawn,36,Senior Flow Controller,shawnlane@supercloud.com


## Creating from a dictionary of lists

In [11]:
temperature_data = {"station_id": [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
                   "date": ['2022-07-13', '2022-07-13', '2022-07-13', '2022-07-13', \
                   '2022-07-14', '2022-07-14', '2022-07-14', '2022-07-14', \
                    '2022-07-15', '2022-07-15', '2022-07-15', '2022-07-15'],
                   "temperature": [36.5, 37.9, 34.3, 40.2, 38.2, 39.8, 35.2, 41.7, 34.1, 37.2, 30.9, 38.6]}

temperatures = pd.DataFrame(temperature_data)
temperatures

Unnamed: 0,station_id,date,temperature
0,1,2022-07-13,36.5
1,2,2022-07-13,37.9
2,3,2022-07-13,34.3
3,4,2022-07-13,40.2
4,1,2022-07-14,38.2
5,2,2022-07-14,39.8
6,3,2022-07-14,35.2
7,4,2022-07-14,41.7
8,1,2022-07-15,34.1
9,2,2022-07-15,37.2


In [12]:
temperatures.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   station_id   12 non-null     int64  
 1   date         12 non-null     object 
 2   temperature  12 non-null     float64
dtypes: float64(1), int64(1), object(1)
memory usage: 416.0+ bytes


In [14]:
# Converting date column to datetime dtype
temperatures['date'] = pd.to_datetime(temperatures['date'])
temperatures.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   station_id   12 non-null     int64         
 1   date         12 non-null     datetime64[ns]
 2   temperature  12 non-null     float64       
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 416.0 bytes


## Create a dataframe that includes nans 

Let's say we want to replace 15% of the values in a dataframe for NaN. To do so, we could iterate over the columns and use the sample method to select the given percentage.

In [22]:
import numpy as np
nan_df = pd.DataFrame(np.random.randn(7, 5))

for col in nan_df.columns:
    nan_df.loc[nan_df.sample(frac=0.15).index, col] = np.nan

nan_df

Unnamed: 0,0,1,2,3,4
0,0.555998,,-0.875981,-0.261545,-0.491278
1,-1.452487,1.829177,,0.401964,
2,-0.44586,0.573537,0.279726,-1.263177,-0.919389
3,-0.248787,0.104669,1.166173,,0.728698
4,,0.355799,1.282254,1.269502,-0.797775
5,1.792932,0.388397,2.031881,-0.372052,0.118529
6,-0.606921,1.52951,-0.998647,1.361117,1.174833


Reference: https://stackoverflow.com/questions/39059032/randomly-insert-nas-values-in-a-pandas-dataframe

# Warning on modifying and copying dataframes

In [5]:
# Instead of df2 = df
# Use .copy()!
# Otherwise you would modify both df2 and df since both are pointing to the same dataframe structure in memory. To prevent this use copy() function

df2 = df.copy()
df2.fillna('0', inplace=True) #You must specify inplace=True, otherwise it doesn't overwrite original values
df2

# Reference: https://stackoverflow.com/questions/27673231/why-should-i-make-a-copy-of-a-data-frame-in-pandas

Unnamed: 0,Name,Age,Position,email
0,James,50,Web Developer,james.heatfilled@gmail.com
1,Astrid,41,Data Analyst,astridleiland@hotmail.com.com
2,Louise,27,Cloud Architect,louiselane@supercloud.com
3,Shawn,36,Senior Flow Controller,shawnlane@supercloud.com
