#Create Pandas DataFrame Tutorial

In [0]:
import pandas as pd



In [0]:
technologies = [["Spark", 20000, "30days"],
               ["pandas", 25000, "40days"]]

df = pd.DataFrame(technologies)
print(df)

        0      1       2
0   Spark  20000  30days
1  pandas  25000  40days


**Since we have not given index and column labels, DataFrame by default assigns incremental sequence numbers as labels to both rows and columns.**

**Column names with sequence numbers don’t make sense as it’s hard to identify what data holds on each column hence, it is always best practice to provide column names that identify the data it holds. Use column param and index param to provide column & custom index respectively to the DataFrame.**

In [0]:
# Add Column & Row Labels to the DataFrame
column_names = ["Courses", "Fee", "Duration"]
row_label = ["a", "b"]
df = pd.DataFrame(technologies, index=row_label ,columns=column_names)
print(df)

  Courses    Fee Duration
a   Spark  20000   30days
b  pandas  25000   40days


##By default, pandas identify the data types from the data and assign’s to the DataFrame. df.dtypes returns the data type of each column.

In [0]:
df.dtypes

Out[4]: Courses     object
Fee          int64
Duration    object
dtype: object

##You can also assign custom data types to columns.

In [0]:
# set custom types to DataFrame
types = {'Courses':str, 'Fee':float, 'Duration':str}
df = df.astype(types)

In [0]:
df.dtypes

Out[6]: Courses      object
Fee         float64
Duration     object
dtype: object

In [0]:
# Create DataFrame from Dictionary

technologies = {
    'Courses': ["Spark", "PySpark", "Hadoop"],
    'Fee': [20000, 25000, 26000],
    'Duration':["30days", "40days", "35days"],
    'Discount':[1000, 2300, 1500]
}

df = pd.DataFrame(technologies)

In [0]:
df

Unnamed: 0,Courses,Fee,Duration,Discount
0,Spark,20000,30days,1000
1,PySpark,25000,40days,2300
2,Hadoop,26000,35days,1500


##Create DataFrame with Index

---


**By default, DataFrame add’s a numeric index starting from zero. It can be changed with a custom index while creating a DataFrame.**

In [0]:

# Create DataFrame with Index.
technologies = {
    'Courses':["Spark","Pandas"],
    'Fee' :[20000,25000],
    'Duration':['30days','40days']
              }
index_label=["r1","r2"]

df = pd.DataFrame(technologies, index=index_label)
print(df)

   Courses    Fee Duration
r1   Spark  20000   30days
r2  Pandas  25000   40days


##Creating Dataframe from list of dicts object


**Sometimes we get data in JSON string (similar dict), you can convert it to DataFrame as shown below.**

In [0]:

# Creates DataFrame from list of dict
technologies = [{'Courses':'Spark', 'Fee': 20000, 'Duration':'30days'},
        {'Courses':'Pandas', 'Fee': 25000, 'Duration': '40days'}]

df = pd.DataFrame(technologies)
print(df)


  Courses    Fee Duration
0   Spark  20000   30days
1  Pandas  25000   40days


##Creating DataFrame From Series

---

**By using concat() method you can create Dataframe from multiple Series. This takes several params, for the scenario we use list that takes series to combine and axis=1 to specify merge series as columns instead of rows.**

In [0]:
# Create pandas Series
courses = pd.Series(["Spark","Pandas"])
fees = pd.Series([20000,25000])
duration = pd.Series(['30days','40days'])

# Create DataFrame from series objects.
df = pd.concat([courses, fees, duration], axis=1)
print(df)

        0      1       2
0   Spark  20000  30days
1  Pandas  25000  40days


##Add Column Labels

---

**As you see above, by default concat() method doesn’t add column labels. You can do so as below.**

In [0]:

# Assign Index to Series
index_labels = ['r1', 'r2']
courses.index = index_labels
fees.index = index_labels
duration.index = index_labels

# Concat Series by Changing Names
df = pd.concat({'Courses':courses,
               'Courses_Fee':fees,
               'Courses_duraion':duration}, axis=1)
print(df)

   Courses  Courses_Fee Courses_duraion
r1   Spark        20000          30days
r2  Pandas        25000          40days


##Creating DataFrame using zip() function

---

**Multiple lists can be merged using zip() method and the output is used to create a DataFrame.**

In [0]:
# Create Lists
Courses = ['Spark', 'Pandas']
Fee = [20000, 25000]
Duration = ['30days', '40days']

#Merge lists by using zip()
tuple_list = list(zip(Courses, Fee, Duration))
df = pd.DataFrame(tuple_list, columns=['Courses', 'Fee', 'Duration'])
print(df)

  Courses    Fee Duration
0   Spark  20000   30days
1  Pandas  25000   40days


##Create an empty DataFrame in pandas


---


**Sometimes you would need to create an empty pandas DataFrame with or without columns. This would be required in many cases, below is one example.**

**While working with files, sometimes we may not receive a file for processing, however, we still need to create a DataFrame manually with the same column names we expect. If we don’t create with the same columns, our operations/transformations (like union’s) on DataFrame fail as we refer to the columns that may not be present.**

**To handle situations similar to these, we always need to create a DataFrame with the expected columns, which means the same column names and datatypes regardless of the file exists or empty file processing.**

**To handle situations similar to these, we always need to create a DataFrame with the expected columns, which means the same column names and datatypes regardless of the file exists or empty file processing.**

In [0]:
# Create Empty DataFrame

df = pd.DataFrame()

print(df)

Empty DataFrame
Columns: []
Index: []


##To create an empty DataFrame with just column names but no data.

In [0]:
# Create Empty DataFraem with Column Labels
df = pd.DataFrame(columns=['Courses', 'Fee', 'Duration'])
print(df)

Empty DataFrame
Columns: [Courses, Fee, Duration]
Index: []


##Create From Another DataFrame

**you can also copy a DataFrame from another DataFrame using copy() method.**

In [0]:
# Copy Dataframe to another
df2 = df.copy()
print(df2)

Empty DataFrame
Columns: [Courses, Fee, Duration]
Index: []
