# Creating DataFrames

In [1]:
import pandas as pd

This is the `DataFrame()` constructor, the keyword arguments are `data`, `index`, `columns`, `dtype`, and `copy`:

In [26]:
pd.DataFrame()

`data` - can be a `NumPy` array, it could be an iterable object, a dictionary, pr another `DataFrame`. And the dictionary object, if it’s passed in, can contain other `Series` objects or arrays or list-like objects.

`index` - index object, or it can be another array-like object. If none is passed in, then it’s going to default to a `RangeIndex`, and so basically these are going to be indexed using integers.

`column` - the names that you pass in will determine the order of the columns. And again, the default here will be a `RangeIndex` if you don’t pass in column names.

`dtype` - if you want to enforce certain data types, you can go ahead and do that. 

`copy` - if you pass in a `DataFrame` or a `NumPy` array in your data, then any changes that you make to the `NumPy` array or the `DataFrame` will affect the `DataFrame` that you’re constructing and vice versa. To a construct a data-independent `DataFrame`, change the `copy` value to `True`. 

In [4]:
import numpy as np

In [5]:
d = {'x': [1,2,3], 'y': np.array([2,4,8]), 'z': 100}

In [6]:
pd.DataFrame(d)

Unnamed: 0,x,y,z
0,1,2,100
1,2,4,100
2,3,8,100


If we don’t pass in any values for the keyword arguments `index` and `columns`, the keys of the dictionary are going to act as the labels for the columns and then the default indices `0`, `1`, `2` will be used for the row labels.

In [7]:
pd.DataFrame(d, index=[100, 200, 300], columns=['z', 'y', 'x'])

Unnamed: 0,z,y,x
100,100,2,1
200,100,4,2
300,100,8,3


Notice that for the `z` column, the value of `100` was repeated for every cell in that column.

Let’s suppose, instead, that the data that you get is a list of dictionaries. What we get is a list of dictionaries, and each of the dictionaries contain the same keys:

In [8]:
lst = [{"x": 1, "y": 2, "z": 100},
       {"x": 2, "y": 4, "z": 100},
       {"x": 3, "y": 8, "z": 100}]

pd.DataFrame(lst)


Unnamed: 0,x,y,z
0,1,2,100
1,2,4,100
2,3,8,100


If we wanted the index to be labeled by the letters '`a`', '`b`', and '`c`', we would just pass those in:

In [9]:
pd.DataFrame(lst, index=['a', 'b', 'c'])

Unnamed: 0,x,y,z
a,1,2,100
b,2,4,100
c,3,8,100


You can also use a list or a nested list of the data:

In [10]:
lst2 = [[1, 2, 100],
        [2, 4, 100],
        [3, 8, 100]]

pd.DataFrame(lst2)

Unnamed: 0,0,1,2
0,1,2,100
1,2,4,100
2,3,8,100


We see that the row `1, 2, 100` is the data that we passed in as the first list, and so on. You can give the columns more descriptive label names. We’ll just pass in `['x', 'y', 'z']`. And again, you can pass in the index if you don’t want the default values from `0` to the number of rows. 

In [11]:
pd.DataFrame(lst2, columns=['x', 'y', 'z'])

Unnamed: 0,x,y,z
0,1,2,100
1,2,4,100
2,3,8,100


Let’s suppose your data is stored in a `NumPy` array.

In [27]:
arr = np.array([[1, 2, 100],
        [2, 4, 100],
        [3, 8, 100]])

df_ = pd.DataFrame(arr, columns=['x', 'y', 'z'])

In [28]:
df_

Unnamed: 0,x,y,z
0,1,2,100
1,2,4,100
2,3,8,100


In [29]:
arr[1,1] = 33


In [30]:
df_

Unnamed: 0,x,y,z
0,1,2,100
1,2,33,100
2,3,8,100


Because the data that we passed in is a `NumPy` array, if we change a value in the `NumPy` array, the corresponding entry also changes to 33.

This may be something that you want, but if it isn’t, then you should pass in a value of `True` to the keyword argument `copy`.

So we’ll go ahead and rerun this cell and make sure the `DataFrame` contains the data that we’re passing in from the original `NumPy` array. Then let’s run that command that changes the value of the `1, 1` entry. And then if we take a look at the `DataFrame`, it still has the original values that were obtained when we first created the array and pass that in for the data.

In [33]:
arr = np.array([[1, 2, 100],
        [2, 4, 100],
        [3, 8, 100]])

df_ = pd.DataFrame(arr, columns=['x', 'y', 'z'], copy=True)

In [34]:
df_

Unnamed: 0,x,y,z
0,1,2,100
1,2,4,100
2,3,8,100


## Creating DataFrames From CSV Files

In [35]:
data = {
    'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
    'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
             'Manchester', 'Cairo', 'Osaka'],
    'age': [41, 28, 33, 34, 38, 31, 37],
    'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
}

index = range(101, 108)

In [36]:
df = pd.DataFrame(data, index=index)

In [38]:
df.to_csv('job_candidates.csv')

Pandas went ahead and created, first, a first line that contains the column labels, and then each row with its corresponding row label, and then all of the data for each row. 
Now, technically, the CSV file—there is no column label for the first column, and this column contains the row labels, or the index, for the `DataFrame` that we used to construct the CSV file.



In [41]:
pd.read_csv('job_candidates.csv')

Unnamed: 0.1,Unnamed: 0,name,city,age,py-score
0,101,Xavier,Mexico City,41,88.0
1,102,Ann,Toronto,28,79.0
2,103,Jana,Prague,33,81.0
3,104,Yi,Shanghai,34,80.0
4,105,Robin,Manchester,38,68.0
5,106,Amal,Cairo,31,61.0
6,107,Nori,Osaka,37,84.0


So if our data was in this form and we wanted to read it in to pandas, then if we wanted to use the first column as the index to the `DataFrame` that we want to construct from the CSV file, we have to pass in a keyword argument to the `read_csv()` method that we’re going to use to load the data. 

If we don’t pass in this `index_col` keyword, then pandas will treat that first column as just any other column. And because we didn’t pass an index keyword, then it’s just going to use default integers for the index labels.

And then the column name for the first column that contained the index of the `DataFrame` is going to be an unnamed column. But, of course, we do want that first column to serve as the index, and so we pass in `0`: 

In [42]:
pd.read_csv('job_candidates.csv', index_col=0)

Unnamed: 0,name,city,age,py-score
101,Xavier,Mexico City,41,88.0
102,Ann,Toronto,28,79.0
103,Jana,Prague,33,81.0
104,Yi,Shanghai,34,80.0
105,Robin,Manchester,38,68.0
106,Amal,Cairo,31,61.0
107,Nori,Osaka,37,84.0


## Understanding DataFrame Attributes

Two important attributes in a `DataFrame` are the `.index` and the .`columns` attributes.



In [43]:
df.index

RangeIndex(start=101, stop=108, step=1)

This is a sequence type object. I can access individual elements of this `RangeIndex` by using list notation. So to access the first element, `101`, using the index value of `1`.

In [45]:
df.index[1]

102

In [46]:
df.columns

Index(['name', 'city', 'age', 'py-score'], dtype='object')

`columns` returns an `Index` object, and it’s also a sequence, so we can access individual elements of this `Index` object just using regular list notation. So for example, if I want to access the third element, which would give me '`age`', I can just use regular list notation. Now, both the .`index` and .columns attributes, they return `Index` objects, and `Index` objects are immutable.

In [48]:
df.columns[2]

'age'

 I can change the entire index. If I wanted to change to a range that starts at 10 and goes to 16 by passing in an arange NumPy object—so, this’ll go from 10 to 17:

In [50]:
df.index = np.arange(10, 17)

In [51]:
df.index

Int64Index([10, 11, 12, 13, 14, 15, 16], dtype='int64')

To access the values, you use the `.values` attribute.



In [52]:
df.values

array([['Xavier', 'Mexico City', 41, 88.0],
       ['Ann', 'Toronto', 28, 79.0],
       ['Jana', 'Prague', 33, 81.0],
       ['Yi', 'Shanghai', 34, 80.0],
       ['Robin', 'Manchester', 38, 68.0],
       ['Amal', 'Cairo', 31, 61.0],
       ['Nori', 'Osaka', 37, 84.0]], dtype=object)

This returns a two-dimensional `NumPy` array where each of the rows of the `NumPy` array are the rows of the DataFrame. There’s also a method called `.to_numpy()`, which does the same thing. The pandas documentation suggests that you should use the `.to_numpy()` method instead because it does offer a little bit of flexibility by passing in a couple of keyword arguments. 

In [53]:
df.to_numpy()

array([['Xavier', 'Mexico City', 41, 88.0],
       ['Ann', 'Toronto', 28, 79.0],
       ['Jana', 'Prague', 33, 81.0],
       ['Yi', 'Shanghai', 34, 80.0],
       ['Robin', 'Manchester', 38, 68.0],
       ['Amal', 'Cairo', 31, 61.0],
       ['Nori', 'Osaka', 37, 84.0]], dtype=object)

Another important attribute of the DataFrame is the `.dtypes` attribute:

In [54]:
df.dtypes

name         object
city         object
age           int64
py-score    float64
dtype: object

if want to change the data types, you could use the `.astype()` method on a `DataFrame`. Let's change the `age` column to have a `NumPy` `int32` data type and the `py-score` column to have a `NumPy` `float32` data type.

In [57]:
df_ = df.astype(dtype={'age': np.int32, 'py-score': np.float32})

In [58]:
df_.dtypes

name         object
city         object
age           int32
py-score    float32
dtype: object

Attributes that give the dimensions and the size of a pandas `DataFrame`, and these are going to be similar to the `NumPy` array attributes: `.ndim`, `.size`, and `.shape`.

In [59]:
df.ndim

2

In [60]:
df.size

28

In [61]:
df.shape

(7, 4)

And the last attribute or method that you might find useful is the amount of memory used by your DataFrame.

In [62]:
df_.memory_usage()

Index       56
name        56
city        56
age         28
py-score    28
dtype: int64