[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](http://colab.research.google.com/github/drillan/python-data-analysis/blob/main/docs/pandas/text.ipynb)

# テキストデータの処理

本節ではpandasを利用してテキストデータを処理する方法を学びます。

サンプルデータとして、 [Python公式ドキュメント](https://docs.python.org/) のモジュール索引をDataFrameに読み込みます。

In [1]:
import pandas as pd

df = (
    pd.read_html("https://docs.python.org/3/py-modindex.html")[0]
    .drop(0, axis=1)
    .rename({1: "module", 2: "description"}, axis=1)
    .dropna()
)
df

Unnamed: 0,module,description
2,__future__,Future statement definitions
3,__main__,The environment where top-level code is run. C...
4,_thread,Low-level threading API.
7,abc,Abstract base classes according to :pep:`3119`.
8,aifc,Deprecated: Read and write audio files in AIF...
...,...,...
387,zipapp,Manage executable Python zip archives
388,zipfile,Read and write ZIP-format archive files.
389,zipimport,Support for importing Python modules from ZIP ...
390,zlib,Low-level interface to compression and decompr...


## .strアクセサ

Series には「.strアクセサ」と呼ばれる各要素の文字列を操作する機能があります。strアクセサからPython組み込みのstr型と同等のメソッドが利用できます。

> - [Working with text data](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#method-summary)
> - [String handling](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#string-handling)

In [2]:
df.loc[:, "description"].str

<pandas.core.strings.accessor.StringMethods at 0x7fe307ee3350>

「description」列において `Deprecated:` から始まる文字を抽出する処理を検討します。.strアクセサから、Pythonのstr型の [startswith](https://docs.python.org/ja/3/library/stdtypes.html#str.startswith) メソッドと同等なメソッドを呼び出せます。

In [3]:
df.loc[:, "description"].str.startswith("Deprecated:")

2      False
3      False
4      False
7      False
8       True
       ...  
387    False
388    False
389    False
390    False
391    False
Name: description, Length: 331, dtype: bool

[.str.startswith](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.startswith.html) メソッドは真理値を返すため、.locインデクサに渡すことで `True` に該当するデータを抽出できます。

In [4]:
df.loc[df.loc[:, "description"].str.startswith("Deprecated:"), :]

Unnamed: 0,module,description
8,aifc,Deprecated: Read and write audio files in AIF...
12,asynchat,Deprecated: Support for asynchronous command/...
14,asyncore,Deprecated: A base class for developing async...
16,audioop,Deprecated: Manipulate raw audio data.
28,cgi,Deprecated: Helpers for running Python script...
29,cgitb,Deprecated: Configurable traceback handler fo...
30,chunk,Deprecated: Module to read IFF chunks.
48,crypt (Unix),Deprecated: The crypt() function used to chec...
171,imghdr,Deprecated: Determine the type of image conta...
172,imp,Deprecated: Access the implementation of the ...


次に「description」列において `Deprecated: ` の文字列を削除する処理を検討します。 [.str.replace](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html) メソッドの第1引数に置換前の文字列、第2引数に置換後の文字列を渡します。ここでは引数 `regex` に `False` を渡して正規表現のパターンマッチングをオフにしています。

In [5]:
df.loc[:, "description"].str.replace("Deprecated: ", "", regex=False)

2                           Future statement definitions
3      The environment where top-level code is run. C...
4                               Low-level threading API.
7        Abstract base classes according to :pep:`3119`.
8       Read and write audio files in AIFF or AIFC fo...
                             ...                        
387                Manage executable Python zip archives
388             Read and write ZIP-format archive files.
389    Support for importing Python modules from ZIP ...
390    Low-level interface to compression and decompr...
391                               IANA time zone support
Name: description, Length: 331, dtype: object

[.str.split](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html?highlight=str%20split#pandas.Series.str.split) メソッドは文字列をリストに分割します。引数を指定しない場合は空白文字で区切られます。

In [6]:
df.loc[:, "description"].str.split()

2                       [Future, statement, definitions]
3      [The, environment, where, top-level, code, is,...
4                           [Low-level, threading, API.]
7      [Abstract, base, classes, according, to, :pep:...
8      [Deprecated:, Read, and, write, audio, files, ...
                             ...                        
387          [Manage, executable, Python, zip, archives]
388      [Read, and, write, ZIP-format, archive, files.]
389    [Support, for, importing, Python, modules, fro...
390    [Low-level, interface, to, compression, and, d...
391                          [IANA, time, zone, support]
Name: description, Length: 331, dtype: object

引数 `expand` に `True` を渡すことで分割された文字列が列に展開されます。

In [7]:
df.loc[:, "description"].str.split(expand=True)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
2,Future,statement,definitions,,,,,,,,,,,,,,
3,The,environment,where,top-level,code,is,run.,Covers,command-line,"interfaces,",import-time,"behavior,",and,``__name__,==,'__main__'``.,
4,Low-level,threading,API.,,,,,,,,,,,,,,
7,Abstract,base,classes,according,to,:pep:`3119`.,,,,,,,,,,,
8,Deprecated:,Read,and,write,audio,files,in,AIFF,or,AIFC,format.,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
387,Manage,executable,Python,zip,archives,,,,,,,,,,,,
388,Read,and,write,ZIP-format,archive,files.,,,,,,,,,,,
389,Support,for,importing,Python,modules,from,ZIP,archives.,,,,,,,,,
390,Low-level,interface,to,compression,and,decompression,routines,compatible,with,gzip.,,,,,,,


.strアクセサに添え字を渡すと、スライス記法が利用できます。次のコードでは「module」列の先頭3文字を取得しています。

In [8]:
df.loc[:, "module"].str[:3]

2      __f
3      __m
4      _th
7      abc
8      aif
      ... 
387    zip
388    zip
389    zip
390    zli
391    zon
Name: module, Length: 331, dtype: object

### 練習問題1

`df` オブジェクトの「module」列から `"(Windows)"` が含まれる行を抽出してください。

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>module</th>
      <th>description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>209</th>
      <td>msilib (Windows)</td>
      <td>Deprecated:  Creation of Microsoft Installer files, and CAB files.</td>
    </tr>
    <tr>
      <th>210</th>
      <td>msvcrt (Windows)</td>
      <td>Miscellaneous useful routines from the MS VC++ runtime.</td>
    </tr>
    <tr>
      <th>358</th>
      <td>winreg (Windows)</td>
      <td>Routines and objects for manipulating the Windows registry.</td>
    </tr>
    <tr>
      <th>359</th>
      <td>winsound (Windows)</td>
      <td>Access to the sound-playing machinery for Windows.</td>
    </tr>
  </tbody>
</table>

In [9]:
# 解答セル