In [1]:
import pandas as pd
import numpy as np

There are three main data structures in pandas:
- Series—1D
- DataFrame—2D
- Panel—3D


In [2]:
pd.Series

pandas.core.series.Series

In [3]:
pd.DataFrame

pandas.core.frame.DataFrame

## Series

In [5]:
import pandas as pd
import numpy as np

In [6]:
pd.Series([1, 2, 3, 4, 5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [11]:
s = pd.Series(np.arange(5), index=['a', 'b', 'c', 'd', 'e'])
s

a    0
b    1
c    2
d    3
e    4
dtype: int64

In [12]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

**Note**:<br> pandas supports non-unique index values. If an operation that does not support duplicate index values is attempted, an exception will be raised at that time. The reason for being lazy is nearly all performance-based (there are many instances in computations, like parts of GroupBy, where the index is not used).

In [3]:
# so you can have non-unique indexes :
s = pd.Series(np.random.randn(5), index=['a', 'b', 'a', 'c' ,'a'])
s

a    2.020283
b    1.410510
a    0.435688
c   -0.334158
a    0.125384
dtype: float64

In [4]:
s['a']

a    2.020283
a    0.435688
a    0.125384
dtype: float64

<br> </br>
### From Dict

In [16]:
dic = {
    'a': 1,
    'b': 2,
    'c': 3,
}

pd.Series(dic)

a    1
b    2
c    3
dtype: int64

<br>Actually you can think of pandas Series as a **Dictionary** <ins>except</ins> you can have non-unique index( key) <br>
in Series <ins>and</ins> you can have non-hashable index in Series

<br>

In [19]:
dic = {
    'a': 1,
    'b': 2,
    'c': 3,
}

index = ['a', 'b', 'c', 'd'] # one more element in here


pd.Series(dic, index=index)

a    1.0
b    2.0
c    3.0
d    NaN
dtype: float64

it becames **NaN**


**Note**: NaN (not a number) is the standard missing data marker used in pandas.



if you used list instead of dictionary here, and your index length was more thant the list length, <br>
you would have recived an error! ( unlike here )
<br> <br>

### From scalar value



In [20]:
pd.Series(5, [0, 1, 2, 3, 4])

0    5
1    5
2    5
3    5
4    5
dtype: int64

<br><br>

## Series is ndarray-like


In [26]:
s = pd.Series(
    np.random.randint(1, 10, size=4),
    index=['a', 'b', 'c', 'd']
)

s

a    7
b    9
c    1
d    4
dtype: int64

In [27]:
s['a']

7

In [28]:
s[0]

7

In [33]:
s[2:]

c    1
d    4
dtype: int64

In [31]:
s[s > s.mean()]

a    7
b    9
dtype: int64

In [39]:
s[ [0, 2, 3] ] #important

a    7
c    1
d    4
dtype: int64

In [41]:
np.log(s)

a    1.945910
b    2.197225
c    0.000000
d    1.386294
dtype: float64

In [42]:
np.mean(s)

5.25

In [43]:
s.dtype

dtype('int64')

<br>
If you need the actual array backing a Series, use Series.array.



In [44]:
pd.array(s)

<PandasArray>
[7, 9, 1, 4]
Length: 4, dtype: int64

In [45]:
s.array

<PandasArray>
[7, 9, 1, 4]
Length: 4, dtype: int64

Series.array will always be an ExtensionArray. Briefly, an ExtensionArray is a thin wrapper around one or more concrete arrays like a numpy.ndarray. pandas knows how to take an ExtensionArray and store it in a Series or a column of a DataFrame. See dtypes for more.

<br>

<br>
While Series is ndarray-like, if you need an actual ndarray, then use Series.to_numpy().

In [46]:
s.to_numpy()

array([7, 9, 1, 4])

<br> <br>

## Series is dict-like

In [47]:
s

a    7
b    9
c    1
d    4
dtype: int64

In [48]:
s['a']

7

In [49]:
s.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [52]:
s.values

array([7, 9, 1, 4])

In [53]:
s['a'] = 0
s

a    0
b    9
c    1
d    4
dtype: int64

In [54]:
'b' in s

True

If a label is not contained, an exception is raised:

Using the get method, a missing label will return None or specified default:

In [57]:
# s['q'] it will raise error

In [59]:
print(s.get('q'))

None


In [60]:
print(s.get('a'))

0


In [62]:
s.get('q', default='not-found') # just like dict

'not-found'

## Vectorized operations and label alignment with Series


In [64]:
s

a    0
b    9
c    1
d    4
dtype: int64

In [65]:
s + s

a     0
b    18
c     2
d     8
dtype: int64

In [66]:
s * 2

a     0
b    18
c     2
d     8
dtype: int64

<br>

A key difference between Series and ndarray is that operations between Series **automatically align the data based on label**. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.

<br></br>
This concept is really important, so --> pandas alligns the data based on it's **index name**

In [67]:
s[1:]

b    9
c    1
d    4
dtype: int64

In [68]:
s[:-1]

a    0
b    9
c    1
dtype: int64

In [69]:
s[1:] + s[:-1]

a     NaN
b    18.0
c     2.0
d     NaN
dtype: float64

one of them doesn't have label 'a'<br>
one of them doesn't have label 'd'<br>
so NaN will be assigned for them

## Name attribute


In [8]:
s = pd.Series(np.random.randn(5), name='serie')
s

0   -0.090041
1    2.658800
2   -0.603770
3   -1.203347
4    0.090409
Name: serie, dtype: float64

In [9]:
s.name

'serie'

In [10]:
s.name = 'serie2'
s.name

'serie2'

In [11]:
s2 = s.rename('serie3')
s2.name

'serie3'

Note that s and s2 refer to different objects.



<br>
<br>

### Copy and View

In [84]:
a = np.random.randn(4, 4)
b = pd.DataFrame(a)
b

Unnamed: 0,0,1,2,3
0,0.399534,0.630542,0.913081,0.20328
1,0.205706,0.834606,-1.246822,-0.607256
2,-1.746456,-0.419059,0.25339,0.601311
3,-0.864971,1.449097,0.651267,0.645717


In [85]:
a[0] = 0
a

array([[ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.20570585,  0.83460556, -1.24682201, -0.60725625],
       [-1.74645592, -0.41905922,  0.25339009,  0.60131108],
       [-0.86497122,  1.44909717,  0.65126742,  0.64571682]])

In [86]:
b

Unnamed: 0,0,1,2,3
0,0.0,0.0,0.0,0.0
1,0.205706,0.834606,-1.246822,-0.607256
2,-1.746456,-0.419059,0.25339,0.601311
3,-0.864971,1.449097,0.651267,0.645717


as you can see the first row of b has also changed, why?


In [3]:
a = np.random.randn(4, 4)
b = pd.DataFrame(a, copy=True)
b

Unnamed: 0,0,1,2,3
0,-0.160965,0.267512,-1.645385,0.169228
1,0.181696,0.854095,1.526907,-0.599061
2,-1.669549,-0.255968,0.238605,-1.005476
3,2.275539,-1.406166,-0.587911,-1.508659


In [4]:
a[0] = 0
a

array([[ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.18169611,  0.85409496,  1.52690695, -0.59906058],
       [-1.66954862, -0.25596821,  0.23860472, -1.00547639],
       [ 2.27553936, -1.40616592, -0.58791133, -1.50865927]])

In [5]:
b

Unnamed: 0,0,1,2,3
0,-0.160965,0.267512,-1.645385,0.169228
1,0.181696,0.854095,1.526907,-0.599061
2,-1.669549,-0.255968,0.238605,-1.005476
3,2.275539,-1.406166,-0.587911,-1.508659


and here we can see that the first row hasn't changed!

___

In [13]:
a = [1, 2, 3, 4]
b = pd.DataFrame(a, copy=False)
b

Unnamed: 0,0
0,1
1,2
2,3
3,4


In [14]:
a = [0, 0, 0, 0]
b

Unnamed: 0,0
0,1
1,2
2,3
3,4


if you give the DataFrame :
1) ndarray 
2) Series

the copy=False will gives you a **MemoryView**

but anything else except these 2 --> will copy the data (even if you put copy=False) 