<a href="https://colab.research.google.com/github/edelord/DS-practice/blob/main/3_1_DataFrame_Creation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://www.gormanalysis.com/blog/python-pandas-for-your-grandpa-3-1-dataframe-creation/

In this section, we’ll look at different ways to create a DataFrame from scratch.

Perhaps the easiest way to make a DataFrame from scratch is to use the DataFrame() constructor, passing in a dictionary of ‘column name:column-values’ pairs. For example, here we build a DataFrame with two columns: ‘name’ and ‘age’, and for each column we pass in a corresponding three-element list of values.

In [1]:
import numpy as np
import pandas as pd

df = pd.DataFrame({'name': ['Bob', 'Sue', 'Mary'], 'age': [39, 57, 28]})
print(df)
##    name  age
## 0   Bob   39
## 1   Sue   57
## 2  Mary   28

   name  age
0   Bob   39
1   Sue   57
2  Mary   28


Let’s pause for a second to talk about what exactly a DataFrame is. In short, a DataFrame is just a table of data with a row index. In this case, the row index is that unlabeled column of values on the far left. To be a little more pedantic, a DataFrame is a collection of identically-sized Series, all of which share the same index. Additionally, DataFrames have a column index for selecting and subsetting columns. We’ll touch on that more later.

Another way you can build a DataFrame is from a list of lists. In this case each inner list represents a row, so you could build the same DataFrame as before using

In [2]:
df = pd.DataFrame([
    ['Bob', 39],
    ['Sue', 57],
    ['Mary', 28]
], columns=['name', 'age'])
print(df)
##    name  age
## 0   Bob   39
## 1   Sue   57
## 2  Mary   28

   name  age
0   Bob   39
1   Sue   57
2  Mary   28


Before we move on, let’s touch on a few important tools for inspecting DataFrames. df.info() is a great tool that basically reports everything you’d want to know about a DataFrame including its size, index type, and column types

In [4]:
df.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 3 entries, 0 to 2
## Data columns (total 2 columns):
##  #   Column  Non-Null Count  Dtype 
## ---  ------  --------------  ----- 
##  0   name    3 non-null      object
##  1   age     3 non-null      int64 
## dtypes: int64(1), object(1)
## memory usage: 176.0+ bytes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    3 non-null      object
 1   age     3 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes


df.shape tells you how many rows and columns df has, just like NumPy does with a 2d array.

In [5]:
df.shape
## (3, 2)

(3, 2)

df.axes returns the row and column indexes.

In [None]:
df.axes
## [RangeIndex(start=0, stop=3, step=1), Index(['name', 'age'], dtype='object')]

and df.size tells you the total number of elements in the DataFrame.

In [6]:
df.size
## 6

6

And because this question comes up so frequently, I’ll deal with it here. To change the column names inside a DataFrame, you can use the .rename() method, and pass in a dictionary of ‘old-name:new-name’ pairs. And you probably want to set inplace=True, otherwise, instead of actually modifying the DataFrame you’re working with, you’ll get back a new, modified copy of it.

So in this case, if we want to change the column-name ‘age’ to ‘years’, we would do

In [7]:
df.rename(columns={'age':'years'}, inplace=True)
print(df)
##    name  years
## 0   Bob     39
## 1   Sue     57
## 2  Mary     28

   name  years
0   Bob     39
1   Sue     57
2  Mary     28
