# Primary data structures

Pandas has two primary data structures: `Series` and  `DataFrame`
1. `Series`: A series is a one-dimensional labeled array that can hold any data type. It's similar to a column in a spreadsheet or a one-dimensional NumPy array. Each element in a series has an associated label called an index. The index allows for more efficient and intuitive data manipulation by making it easier to reference specific elements of your data.
2. `DataFrame`:A dataframe is a two-dimensional labeled data structure-- essentially a table or spreadsheet where each column and row in represented by a series.

# Create a DataFrame

First import it.

In [4]:
import pandas as pd
import numpy as np

In [2]:
# Create from a dictionary
d = {'col1': [1,2], 'col2': [3,4]}
df = pd.DataFrame(d)
df

Unnamed: 0,col1,col2
0,1,3
1,2,4


In [5]:
# Create from a numpy array

df2 = pd.DataFrame(np.array([[1,2,3],[4,5,6],[7,8,9]]), columns=['a','b','c'])
df2

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


In [None]:
# Create from a file (csv)
df3 = df.read_csv('file_path')

# Attributes and methods

The DataFrame class is powerful and convenient because it comes with a suit of built-in features that simplify common data analysis tasks.

These features are known as attributes and methods.

1. An attribute is a value associated with an object or class that is referenced by name using dotted expressions.
2. A method is a function that is defined inside a class body and typically performs an action.

A simpler way of thinking about the distinction between attributes and methods is to remember that attributes are characteristics of the object, while methods are actions or operations.

## Common DataFrame attributes

Data professionals use attributes and methods constantly. Some of the most-used DataFrame attributes include:

| attribute | Description |
|-----------|-------------|
|columns|Returns the column labels of the dataframe|
|dtypes|Returns the data types in the dataframe|
|iloc|Accesses a group of rows and columns using integer-based indexing|
|loc|Accesses a group of rows and columns by label(s) or a Boolean array|
|shape|Returns a tuple representing the dimensionality of the dataframe|
|values|Returns a NumPy representation of the dataframe|

## Common DataFrame methods

Some of the most-used DataFrame methods include:

|Method|Description|
|------|-----------|
|apply()|Applies a function over an axis of the dataframe|
|copy()|Makes a copy of the dataframe’s indices and data|
|describe()|Returns descriptive statistics of the dataframe, including the minimum, maximum, mean, and percentile values of its numeric columns; the row count; and the data types|
|drop()|Drops specified labels from rows or columns|
|groupby|Splits the dataframe, applies a function, and combines the results|
|head(n=5)|Returns the first n rows of the dataframe (default=5)|
|info()|Returns a concise summary of the dataframe|
|isna()|Returns a same-sized Boolean dataframe indicating whether each value is null (can also use isnull() as an alias)|
|sort_values()|Sorts by the values across a given axis|
|value_counts()|Returns a series containing counts of unique rows in the dataframe|
|where()|Replaces values in the dataframe where a given condition is false|

# Selection statements

## Row Selection
Rows of a dataframe are selected by their index. The index can be referenced either by name or by numeric position.

### `loc[]`

`loc[]` lets you select rows by name.

In [7]:
df = pd.DataFrame({
    'A': ['alpha', 'apple', 'arsenic', 'angel', 'android'],
    'B': [1, 2, 3, 4, 5],
    'C': ['coconut', 'curse', 'cassava', 'cuckoo', 'clarinet'],
    'D': [6, 7, 8, 9, 10]
    },index=['row_0', 'row_1', 'row_2', 'row_3', 'row_4'])
df

Unnamed: 0,A,B,C,D
row_0,alpha,1,coconut,6
row_1,apple,2,curse,7
row_2,arsenic,3,cassava,8
row_3,angel,4,cuckoo,9
row_4,android,5,clarinet,10


In [8]:
# The row index of the dataframe contains the names of the rows. Use loc[] to select rows by name
print(df.loc['row_1'])

A    apple
B        2
C    curse
D        7
Name: row_1, dtype: object


In [9]:
# Inserting the row index name as a list returns a DataFrame object
print(df.loc[['row_1']])

           A  B      C  D
row_1  apple  2  curse  7


In [10]:
# To select multiple rows by name, use a list within selector brackets
print(df.loc[['row_1','row_3']])

           A  B       C  D
row_1  apple  2   curse  7
row_3  angel  4  cuckoo  9


In [11]:
# specify a range of rows by named index
print(df.loc['row_1':'row_3'])

             A  B        C  D
row_1    apple  2    curse  7
row_2  arsenic  3  cassava  8
row_3    angel  4   cuckoo  9


### `iloc[]`

`iloc[]` lets you select rows by numeric position, similar to how you would access elements of a list or an array. Here’s an example.

In [12]:
print(df)
print()
print(df.iloc[1])

             A  B         C   D
row_0    alpha  1   coconut   6
row_1    apple  2     curse   7
row_2  arsenic  3   cassava   8
row_3    angel  4    cuckoo   9
row_4  android  5  clarinet  10

A    apple
B        2
C    curse
D        7
Name: row_1, dtype: object


In [13]:
# Inserting the row index number as list returns a dataframe object
print(df.iloc[[1]])

           A  B      C  D
row_1  apple  2  curse  7


In [14]:
# To select multiple rows by index number, use a list within selector brackets
print(df.iloc[[0,2,4]])

             A  B         C   D
row_0    alpha  1   coconut   6
row_2  arsenic  3   cassava   8
row_4  android  5  clarinet  10


In [15]:
# Specify a range of rows by index number
print(df.iloc[0:3])

             A  B        C  D
row_0    alpha  1  coconut  6
row_1    apple  2    curse  7
row_2  arsenic  3  cassava  8


## Column selection

Column selection works the same way as row selection, but there are also some shortcuts to make the process easier. 

For example, to select an individual column, simply put it in selector brackets after the name of the dataframe

In [17]:
print(df['C'])

row_0     coconut
row_1       curse
row_2     cassava
row_3      cuckoo
row_4    clarinet
Name: C, dtype: object


In [18]:
# to select multiple columns, use a list in selector brackets
print(df[['A','C']])

             A         C
row_0    alpha   coconut
row_1    apple     curse
row_2  arsenic   cassava
row_3    angel    cuckoo
row_4  android  clarinet


In [19]:
# or you can use dot notation
print(df.A)

row_0      alpha
row_1      apple
row_2    arsenic
row_3      angel
row_4    android
Name: A, dtype: object


Dot notation is often convenient and easier to type. However, it can make your code more difficult to read, especially in longer statements involving method chaining or condition-based selection.

### `loc[]`

You can also use  `loc[]` notation.

Note that when using loc[] to select columns, you must specify rows as well. In this example, all rows were selected using just a colon (:).

In [20]:
print(df)
print()

print(df.loc[:, ['B', 'D']])

             A  B         C   D
row_0    alpha  1   coconut   6
row_1    apple  2     curse   7
row_2  arsenic  3   cassava   8
row_3    angel  4    cuckoo   9
row_4  android  5  clarinet  10

       B   D
row_0  1   6
row_1  2   7
row_2  3   8
row_3  4   9
row_4  5  10


### `iloc[]`

Similarly, you can use iloc[] notation. Again, when using iloc[], you must specify rows, even if you want to select all rows

In [21]:
print(df.iloc[:, [1,3]])

       B   D
row_0  1   6
row_1  2   7
row_2  3   8
row_3  4   9
row_4  5  10


## Select rows and columns

Both `loc[]` and `iloc[]` can be used to select specific rows and columns together.

###  `loc[]`

when using loc[] to select a range, the final element in the range is included in the results.

In [22]:
print(df.loc['row_0':'row_2', ['A','C']])

             A        C
row_0    alpha  coconut
row_1    apple    curse
row_2  arsenic  cassava


In [23]:
print(df.loc['row_0':'row_2', 'A':'C'])

             A  B        C
row_0    alpha  1  coconut
row_1    apple  2    curse
row_2  arsenic  3  cassava


### `iloc[]`

when using rows with named indices, you cannot mix numeric and named notation. 

In [24]:
print(df.iloc[[2, 4], 0:3])

             A  B         C
row_2  arsenic  3   cassava
row_4  android  5  clarinet


To view rows [0:3] at column ‘D’ (if you don’t know the index number of column D), you’d have to use selector brackets after an iloc[] statement

In [25]:
# This is most convenient for VIEWING: 
print(df.iloc[0:3][['D']])

# But this is best practice/more stable for assignment/manipulation:
print(df.loc[df.index[0:3], 'D'])

       D
row_0  6
row_1  7
row_2  8
row_0    6
row_1    7
row_2    8
Name: D, dtype: int64
