<h1>Intro to pandas</h1>

<p>When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame</p>

![image.png](attachment:image.png)

<h6>A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data.frame in R</h6>

<p><span style='color:green;font-size:20px'>Load</span><span style='color:blue;font-size:20px'> Pandas</span></p>
<p>To load the pandas package and start working with it, import the package. The community agreed alias for pandas is <code>pd</code>, so loading pandas as <code >pd</code> is assumed standard practice for all of the pandas documentation.</p>

In [3]:
import pandas as pd

<p><span style="font-weight:bold;font-size:20px;color:#100c08">  Creating a</span> <span style='color:#a4c639;font-size:20px;font-weight:bold'>DataFrame.</span><p>

In [4]:
# Mnemonic name for DataFrame df 

df=pd.DataFrame({    
    'Name':['Simran','Rohit','Raghu','Saachi','Kavya'],
    "Age":[i for i in range(30,80,10)],
    "Experience":[i for i in range(5,18,3)]})
df

Unnamed: 0,Name,Age,Experience
0,Simran,30,5
1,Rohit,40,8
2,Raghu,50,11
3,Saachi,60,14
4,Kavya,70,17


<p>Notice that the inferred dtype is int64 & object.</p>

In [5]:
df.dtypes

Name          object
Age            int64
Experience     int64
dtype: object

To enforce a single dtype:

In [6]:
import numpy as np

df=pd.DataFrame({    
    'Name':['Simran','Rohit','Raghu','Saachi','Kavya'],
    "Age":[i for i in range(30,80,10)],
    "Experience":[i for i in range(5,18,3)]},dtype=np.str0)


df.dtypes

Name          object
Age           object
Experience    object
dtype: object

Constructing DataFrame from a dictionary including Series:

In [7]:
df=pd.DataFrame({    
    'Name':['Simran','Rohit','Raghu','Saachi','Kavya'],
    "Age":[i for i in range(30,80,10)],
    "Experience":pd.Series([i for i in range(5,18,3)])})

df


Unnamed: 0,Name,Age,Experience
0,Simran,30,5
1,Rohit,40,8
2,Raghu,50,11
3,Saachi,60,14
4,Kavya,70,17


Constructing DataFrame from numpy ndarray:

In [11]:
pip install random-word

Collecting random-word
  Downloading Random_Word-1.0.11-py3-none-any.whl (1.2 MB)
Installing collected packages: random-word
Successfully installed random-word-1.0.11
Note: you may need to restart the kernel to use updated packages.


In [12]:
# Genrate Random column names
from random_word import RandomWords



In [13]:
r = RandomWords()

labels=[r.get_random_word() for i in range(10)]
arr=np.random.randint(1,1000,100).reshape(10,10)
df=pd.DataFrame(arr,columns=labels)

df

Unnamed: 0,clypeola,afflictively,quadrisyllabic,oscinian,eudaemonistic,proagricultural,missingly,becuiba,oxbane,undesirableness
0,508,260,49,222,963,615,659,589,846,733
1,961,345,355,726,991,838,858,255,41,620
2,577,155,625,779,892,307,117,32,312,773
3,840,261,181,558,350,839,506,99,59,611
4,469,3,912,907,970,733,146,685,599,135
5,707,400,236,248,79,133,490,519,969,864
6,40,199,147,458,102,306,373,940,759,731
7,748,3,749,41,516,180,138,429,661,385
8,283,750,197,98,606,438,753,535,980,994
9,912,26,701,420,358,582,150,244,164,816


<p style="font-size:25px;color:#f2f3f4"><mark>Attributes</mark></p>

<p style=";font-size:20px">𝓓𝓪𝓽𝓪𝓕𝓻𝓪𝓶𝓮.<span style="color:	#008000;font-size:20px">𝓪𝓽</span></p>
<p>Access a single value for a row/column label pair.</p>
<p>Similar to <code><span>loc</span></code>, in that both provide label-based lookups. Use
<code ><span >at</span></code> if you only need to get or set a single value in a DataFrame
or Series.</p>

In [17]:
df.at[0,'clypeola']

508

<p style=";font-size:20px">𝓓𝓪𝓽𝓪𝓕𝓻𝓪𝓶𝓮.<span style="color:	#008000;font-size:20px">𝓪𝔁𝓮𝓼</span></p>
<p>Return a list representing the axes of the DataFrame.</p>
<p>It has the row axis labels and column axis labels as the only members.
They are returned in that order.</p>

In [18]:
df.axes

[RangeIndex(start=0, stop=10, step=1),
 Index(['clypeola', 'afflictively', 'quadrisyllabic', 'oscinian',
        'eudaemonistic', 'proagricultural', 'missingly', 'becuiba', 'oxbane',
        'undesirableness'],
       dtype='object')]

<p style=";font-size:20px">𝓓𝓪𝓽𝓪𝓕𝓻𝓪𝓶𝓮.<span style="color:	#008000;font-size:20px">𝓬𝓸𝓵𝓾𝓶𝓷𝓼</span></p>
<p>The column labels of the DataFrame.</p>

In [19]:
df.columns

Index(['clypeola', 'afflictively', 'quadrisyllabic', 'oscinian',
       'eudaemonistic', 'proagricultural', 'missingly', 'becuiba', 'oxbane',
       'undesirableness'],
      dtype='object')

<p style=";font-size:20px">𝓓𝓪𝓽𝓪𝓕𝓻𝓪𝓶𝓮.<span style="color:	#008000;font-size:20px">𝓭𝓽𝔂𝓹𝓮𝓼</span></p>
<p>Return the dtypes in the DataFrame.</p>
<p>This returns a Series with the data type of each column.
The result’s index is the original DataFrame’s columns. Columns with mixed types are stored with the <code><span>object</span></code> dtype.</p>

In [32]:
df.dtypes

clypeola           int32
afflictively       int32
quadrisyllabic     int32
oscinian           int32
eudaemonistic      int32
proagricultural    int32
missingly          int32
becuiba            int32
oxbane             int32
undesirableness    int32
dtype: object

<p style=";font-size:20px">𝓓𝓪𝓽𝓪𝓕𝓻𝓪𝓶𝓮.<span style="color:#008000;font-size:20px">𝓮𝓶𝓹𝓽𝔂</span></p><hr>
Checks if a Series/DataFrame is completely empty

In [21]:
df.empty

False

<p style=";font-size:20px">𝓓𝓪𝓽𝓪𝓕𝓻𝓪𝓶𝓮.<span style="color:#008000;font-size:20px">𝓲𝓪𝓽</span></p>
<p>Access a single value for a row/column pair by integer position.</p>
<p>Similar to <code class="docutils literal notranslate"><span class="pre">iloc</span></code>, in that both provide integer-based lookups. Use
<code class="docutils literal notranslate"><span class="pre">iat</span></code> if you only need to get or set a single value in a DataFrame
or Series.</p>

In [22]:
df.iat[0,0]

508

<p style=";font-size:20px">𝓓𝓪𝓽𝓪𝓕𝓻𝓪𝓶𝓮.<span style="color:#008000;font-size:20px">𝓲𝓵𝓸𝓬</span></p>
<p>Purely integer-location based indexing for selection by position.</p>
<p><code><span>.iloc[]</span></code> is primarily integer position based (from <code><span>0</span></code> to
<code><span>length-1</span></code> of the axis), but may also be used with a boolean array.</p>

<p>Allowed inputs are:</p>
<ul>
<li><p>An integer, e.g. <code><span>5</span></code>.</p></li>
<li><p>A list or array of integers, e.g. <code><span>[4,</span> <span>3,</span> <span>0]</span></code>.</p></li>
<li><p>A slice object with ints, e.g. <code><span>1:7</span></code>.</p></li>
<li><p>A boolean array.</p></li>
<li><p>A <code><span>callable</span></code> function with one argument (the calling Series or
DataFrame) and that returns valid output for indexing (one of the above).
This is useful in method chains, when you don’t have a reference to the
calling object, but would like to base your selection on some value.</p></li>
</ul>

<p><code><span>.iloc</span></code> will raise <code><span>IndexError</span></code> if a requested indexer is
out-of-bounds, except <em>slice</em> indexers which allow out-of-bounds indexing (this conforms with python/numpy <em>slice</em> semantics).</p>

In [23]:
df.iloc[0:,0:]

Unnamed: 0,clypeola,afflictively,quadrisyllabic,oscinian,eudaemonistic,proagricultural,missingly,becuiba,oxbane,undesirableness
0,508,260,49,222,963,615,659,589,846,733
1,961,345,355,726,991,838,858,255,41,620
2,577,155,625,779,892,307,117,32,312,773
3,840,261,181,558,350,839,506,99,59,611
4,469,3,912,907,970,733,146,685,599,135
5,707,400,236,248,79,133,490,519,969,864
6,40,199,147,458,102,306,373,940,759,731
7,748,3,749,41,516,180,138,429,661,385
8,283,750,197,98,606,438,753,535,980,994
9,912,26,701,420,358,582,150,244,164,816


<p style=";font-size:20px">𝓓𝓪𝓽𝓪𝓕𝓻𝓪𝓶𝓮.<span style="color:#008000;font-size:20px">𝓲ndex</span></p>
<p>The index (row labels) of the DataFrame.</p>

In [24]:
df.index

RangeIndex(start=0, stop=10, step=1)

<p style=";font-size:20px">𝓓𝓪𝓽𝓪𝓕𝓻𝓪𝓶𝓮.<span style="color:#008000;font-size:20px">𝓵𝓸𝓬</span></p>
<p>Access a group of rows and columns by label(s) or a boolean array.</p>
<p><code><span>.loc[]</span></code> is primarily label based, but may also be used with a boolean array.</p>

<p>Allowed inputs are:</p>
<ul>
<li><p>A single label, e.g. <code><span>5</span></code> or <code><span>'a'</span></code>, (note that <code><span>5</span></code> is
interpreted as a <em>label</em> of the index, and <strong>never</strong> as an
integer position along the index).</p></li>
<li><p>A list or array of labels, e.g. <code><span>['a',</span> <span>'b',</span> <span>'c']</span></code>.</p></li>
<li><p>A slice object with labels, e.g. <code><span>'a':'f'</span></code>.</p>
<div>
<p>Warning</p>
<p>Note that contrary to usual python slices, <strong>both</strong> the
start and the stop are included</p>
</div>
</li>
<li><p>A boolean array of the same length as the axis being sliced,
e.g. <code><span>[True,</span> <span>False,</span> <span>True]</span></code>.</p></li>
<li><p>An alignable boolean Series. The index of the key will be aligned before
masking.</p></li>
<li><p>An alignable Index. The Index of the returned selection will be the input.</p></li>
<li><p>A <code><span>callable</span></code> function with one argument (the calling Series or
DataFrame) and that returns valid output for indexing (one of the above)</p></li>
</ul>

In [25]:
df.loc[5]


clypeola           707
afflictively       400
quadrisyllabic     236
oscinian           248
eudaemonistic       79
proagricultural    133
missingly          490
becuiba            519
oxbane             969
undesirableness    864
Name: 5, dtype: int32

<p style=";font-size:20px">𝓓𝓪𝓽𝓪𝓕𝓻𝓪𝓶𝓮.<span style="color:#008000;font-size:20px">𝓷𝓭𝓲𝓶</span></p>

<p>Return an int representing the number of axes / array dimensions.</p>
<p>Return 1 if Series. Otherwise return 2 if DataFrame.</p>

In [26]:

df.ndim


2

<p style=";font-size:20px">𝓓𝓪𝓽𝓪𝓕𝓻𝓪𝓶𝓮.<span style="color:#008000;font-size:20px">𝓼𝓱𝓪𝓹𝓮</span></p>
<p>Return a tuple representing the dimensionality of the DataFrame.</p>

In [28]:
df.shape

(10, 10)

<p style=";font-size:20px">𝓓𝓪𝓽𝓪𝓕𝓻𝓪𝓶𝓮.<span style="color:#008000;font-size:20px">𝓼𝓲𝔃𝓮</span></p>

<p>Return an int representing the number of elements in this object.</p>
<p>Return the number of rows if Series. Otherwise return the number of rows times number of columns if DataFrame.</p>

In [29]:
df.size

100

<p style=";font-size:20px">𝓓𝓪𝓽𝓪𝓕𝓻𝓪𝓶𝓮.<span style="color:#008000;font-size:20px">𝓿𝓪𝓵𝓾𝓮𝓼</span></p>
<p>Return a Numpy representation of the DataFrame.</p>

In [30]:
x=df.values
x


array([[508, 260,  49, 222, 963, 615, 659, 589, 846, 733],
       [961, 345, 355, 726, 991, 838, 858, 255,  41, 620],
       [577, 155, 625, 779, 892, 307, 117,  32, 312, 773],
       [840, 261, 181, 558, 350, 839, 506,  99,  59, 611],
       [469,   3, 912, 907, 970, 733, 146, 685, 599, 135],
       [707, 400, 236, 248,  79, 133, 490, 519, 969, 864],
       [ 40, 199, 147, 458, 102, 306, 373, 940, 759, 731],
       [748,   3, 749,  41, 516, 180, 138, 429, 661, 385],
       [283, 750, 197,  98, 606, 438, 753, 535, 980, 994],
       [912,  26, 701, 420, 358, 582, 150, 244, 164, 816]])

<H2><span style="color:blue">Read</span> and <span style="color:green">Write.</span> Tabular data</H2>
<p>pandas supports the integration with many file formats or data sources out of the box (csv, excel, sql, json, parquet,…). Importing data from each of these
data sources is provided by function with the prefix <code class="docutils literal notranslate"><span class="pre">read_*</span></code>. Similarly, the <code class="docutils literal notranslate"><span class="pre">to_*</span></code> methods are used to store data.</p>

<img alt="../_images/02_io_readwrite.svg" class="align-center" src="https://pandas.pydata.org/pandas-docs/stable/_images/02_io_readwrite.svg">